BAYESIAN ANALYSIS OF SOFTWARE COST AND QUALITY MODELS

Size: px

Start display at page:

Download "BAYESIAN ANALYSIS OF SOFTWARE COST AND QUALITY MODELS"

Peter Pierce
6 years ago
Views:

1 BAYESIAN ANALYSIS OF SOFTWARE COST AND QUALITY MODELS by Sunita Devnani-Chulani A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 1999 Copyright 1999 Sunita Devnani-Chulani

2 This dissertation is dedicated to Jai Chulani If you press me to say why I loved him, I can say no more than it was because he was he, and I was I. - Montaigne ii

3 ACKNOWLEDGEMENTS Although I have dedicated this dissertation to my husband, Jai, I would once again like to acknowledge that without his complete support and continuous encouragement, it would be impossible for me to get this far. Jai - you have made my life so beautiful! I would also like to thank my family members - my parents (the Devnanis and the Chulanis), my grandmother, my siblings (Sanjay and Kamna, Kavita and Haresh, Varsha and Nishit, and Varkha) and my nieces and nephews (Kunal, Priyanka, Jay, Krish and Yash) for their love and support. I am also deeply grateful to my Ph.D. advisor, Dr. Barry Boehm, for being such a great mentor. I really appreciate all the valuable advice he has provided me in the last four years and I look forward to continue learning from him in the many years ahead. My thanks also go to the following key people: Dr. Bert Steece, for his deep interest in my Ph.D. work. Dr. Steece helped me come up with many good ideas and suggestions to evolve them; Dr. Peter Hantos, who has given me valuable advice whenever I ve needed it and has given me a very good perspective of the world outside of academia. I would also like to thank all my dear friends and the COCOMO research group members. I will miss them all a great deal. And last but not the least, I would like to acknowledge my grandparents who are no longer with us. As a child, I spent most of my weekends with them. During the day, my grandfather would teach me Math and Science (he was a well-known teacher and it iii

4 was his tutoring in the early years that developed my keen interest in numbers). And in the evening, my grandmother would cook great food and tell me amusing bed-time stories. I miss them both very dearly. iv

5 Table of Contents 1. Introduction 1 2. Existing Software Estimation Models and Techniques Introduction Model-Based Techniques The Putnam Software Life Cycle Model (SLIM) The Jensen Model The Bailey-Basili Model Checkpoint PRICE-S Estimacs SEER-SEM Softcost COCOMO Software Reliability Growth Models Summary of Model-Based Techniques Expertise-Based Techniques The Delphi Approach Rule-Based Systems Work Breakdown Structures (WBS) Learning-Oriented Techniques Neural Networks Case-Based Reasoning (CBR) Dynamics-Based Techniques Regression-Based Techniques Standard Regression Ordinary Least Squares Method Robust Regression Composite Techniques The Bayesian Approach Conclusions on Existing Software Estimation Techniques The Research Approach and Framework Introduction The Modeling Methodology The Bayesian Approach A Simple Software Cost Estimation Model The COnstructive QUALity MOdel (COQUALMO) Framework The Software Defect Introduction and Removal Model Research Contributions Introduction COCOMO II Calibration 65 v

6 4.2.1 The Multiple Regression Calibration Approach The Bayesian Calibration Approach The Generalized G-Prior Approach The Reduced Model Conclusions on COCOMO II Calibration COnstructive QUALity MOdel (COQUALMO) The Software Defect Introduction (DI) Sub-Model The Software Defect Removal (DR) Sub-Model Proposed DI Sub-Model to DR Sub-Model Rosetta-Stone An Independent Validation Study COQUALMO Integrated with COCOMO II Conclusions on COQUALMO Summary of Contributions and Future Research Directions Introduction Summary of Contributions Future Research Directions Bibliography Appendices 143 A COCOMO II and COQUALMO Delphi Results 143 B COCOMO II and COQUALMO Cost Estimation Questionnaire 168 C Summary of COCOMO II Data 210 D COCOMO II.1999 Prior, Sampling, Posterior Bayesian Means and Variances 218 vi

7 List of Figures 2.1: Software Estimation Techniques Classification 5 2.2: The Rayleigh Model 6 2.3: The Trachtenburg Reliability Rayleigh Curve 9 2.4: SPR s Summary of Opportunities : Rubin s Map of Relationships of Estimation Dimensions : SEER-SEM Inputs and Outputs : Hazard function proposed by the Jelinski-Moranda Reliability Growth Model : A Product Work Breakdown Structure : An Activity Work Breakdown Structure : A Neural Network Estimation Model : The Seven-Step Modeling Methodology : A Simple Software Cost Model : Prior Density Functions : Post Sample Density Functions: Modeling under complete prior uncertainty : Prior density function of B : Post Sample Density Functions (Modeling with the Inclusion of Prior Information) : The Bayesian Approach : The Software Defect Introduction and Removal Model : Defect Injection and Removal of a Process Step [Kan, 1996] : Distribution of RUSE : Example of the 10% weighted average approach: RUSE Rating Scale : Distribution of Effort and Size: 1999 dataset of 161 observations 86 vii

8 4.4: Distribution of log transformed Effort and Size: 1999 dataset of 161 observations : Correlation between log[effort] and log[size] : A-Posteriori Bayesian Update in the Presence of Noisy Data (Develop for Reuse, RUSE) : Bayesian A-Posteriori Productivity Ranges : A-Posteriori Generalized g-prior Update in the Presence of Noisy Data (AEXP, Applications Experience) : The Defect Introduction Sub-Model of COQUALMO : Coding Defect Introduction Ranges : The Defect Removal Sub-Model of COQUALMO : Reported Defect Densities [Grady, 1997] : DI Sub-Model to DR Sub-Model Rosetta-Stone : COQUALMO Integrated with COCOMO II 128 viii

9 List of Tables 2.1: Pros and Cons of Expertise-Based Techniques : Wideband Delphi Approach : Prediction Accuracy of COCOMO II.1997 vs. COCOMO II : Step 1 - Factors affecting Cost and Quality : Data Reporting Scheme : COCOMO II Post Architecture Parameters : COCOMO II.1997 A-Priori Values : COCOMO II.1997 Highly Correlated Parameters a: Develop for Reuse (RUSE) Expert-determined a-priori rating scale b: Develop for Reuse (RUSE) Data-determined rating scale : COCOMO II.1997 Values : Prediction Accuracy of COCOMO II : COCOMO II.1999 A-Priori Rating Scale for Develop for Reuse (RUSE) : Delphi-Determined COCOMO II.1999 "A-Priori" Values : COCOMO II.1999 Values : Prediction Accuracies of COCOMO II.1997, A-Priori COCOMO II.1999 and Bayesian A-Posteriori COCOMO II.1999 Before and After Stratification : 10% Weighted-average Regression Values on COCOMO II.1999 Dataset : Prediction Accuracies Using the 10% Weighted-Average Multiple-Regression Approach and the Bayesian Approach on the1999 dataset of 161 datapoints : Prediction Accuracies Using the Pure-Regression, the 10% Weighted-Average Multiple-Regression Approach and the Bayesian Based Models Calibrated Using the 1997 dataset of 83 datapoints and Validated Against 83 and 161 datapoints : Multiple-Regression and Generalized G-prior Estimates 100 ix

10 4.15: PRED (.30) Using 10% Weighted-Average Multiple Regression, Bayesian Approach and G-Prior Approaches on the 1999 Dataset of 161 Datapoints : Prediction Accuracies of Reduced COCOMO II : Defect Introduction Drivers : Programmer Capability (PCAP) Differences in Defect Introduction : Initial Data Analysis on the DI Model : The Defect Removal Profiles : Results of 2-Round Delphi for Defect Removal Fractions for Automated Analysis : Defect Density Results from Initial DRF Values : DI Sub-Model to DR Sub-Model Rosetta-Stone : Project A Characteristics : Project A Defect Introduction Rates : Project A Defect Removal Profiles : Project A Residual Defect Density 127 x

11 Abstract Software cost and quality estimation has become an increasingly important field due to the increasingly pervasive role of software in today s world. In spite of the existence of about a dozen software estimation models, the field continues to remain nottoo-well-understood, causing growing concerns in the software-engineering community. In this dissertation, the existing techniques that are used for building software estimation models are discussed with a focus on the empirical calibration of the models. It is noted that traditional calibration approaches (especially the popular multipleregression approach) can have serious difficulties when used on software engineering data that is usually scarce, incomplete, and imprecisely collected. To alleviate these problems, a composite technique for building software models based on a mix of data and expert judgement is discussed. This technique is based on the well understood and widely accepted Bayes theorem that has been successfully applied in other engineering domains including to some extent in the software-reliability engineering domain. But, the Bayesian approach has not been effectively exploited for building more robust software estimation models that use a variance-balanced mix of project data and expert judgement. The focus of this dissertation is to show the improvement in accuracy of the cost estimation model (COCOMO II) when the Bayesian approach is employed versus the multiple regression approach. When the Bayesian model calibrated using a dataset of 83 datapoints is validated on a dataset of 161 datapoints (all datapoints are actual completed software projects collected from Commercial, Aerospace, Government and non-profit organizations), it yields a prediction accuracy of PRED(.30) = 66% (i.e. 106 or 66% of xi

12 the 161 datapoints are estimated within 30% of the actuals). Whereas the pure-regression based model calibrated using 83 datapoints when validated on the same 161 project dataset yields a poorer accuracy of PRED(.30) = 44%. A quality model extension of the COCOMO II model, namely COQUALMO, is also discussed. The development of COQUALMO from its onset enables one to understand how a comprehensive modeling methodology can be used to build effective software estimation models using the Bayesian framework elaborated in this dissertation. xii

13 CHAPTER 1: Introduction Due to the pervasive nature of software, software-engineering practitioners have continuously expressed their concerns over their inability to accurately predict the cost, schedule and quality of a software product under development. Thus, one of the most important objectives of the software engineering community has been to develop useful models that constructively explain the software development life-cycle and accurately predict the cost, schedule and quality of developing a software product. To that end, many parametric software estimation models have evolved over the last three decades based on pioneering efforts such as Putnam and Quantitative Software Measurement s SLIM model, Jones and Software Productivity Research s Checkpoint model, Park and PRICE Systems PRICE-S model, Jensen and the SEER SEM model, Rubin and the Estimacs model and Boehm and the COCOMO model [Putnam, 1992, Jones, 1997, Park, 1988, Jensen, 1983, Rubin, 1983, Boehm, 1981, Boehm, 1995, Walkerden, 1997, Conte, 1986, Fenton, 1991, Masters, 1985, Mohanty, 1981]. Almost all of the above mentioned parametric models have been empirically calibrated to actual data from completed software projects. The most commonly used technique for empirical calibration has been the popular classical multiple regression approach. This approach imposes a few restrictions often violated by software engineering data and has resulted in the development of inaccurate empirical models that do not perform very well when used for prediction [Briand, 1992, Fuller, 1987, Judge, 1985, Judge, 1993, Mullet, 1976, Weisberg, 1985]. 1

14 The focus of this dissertation is to explain the drawbacks of the multiple regression approach for software engineering data and discuss the Bayesian approach which alleviates a few of the problems faced by the multiple regression approach. Bayesian analysis is a well-defined and rigorous process of inductive reasoning that has been used in many scientific disciplines [the reader can refer to Gelman, 1995, Zellner, 1983, Box, 1973 for a broader understanding of the Bayesian Analysis approach]. A distinctive feature of the Bayesian approach is that it permits the investigator to use both sample (data) and prior (expert-judgement and/or older data) information in a logically consistent manner in making inferences. This is done by using Bayes theorem to produce a post-data or posterior distribution for the model parameters. Using Bayes theorem, prior (or initial) values are transformed to post-data views. This transformation can be viewed as a learning process. The posterior distribution is determined by the variances of the prior and sample information. If the variance of the prior information is smaller than the variance of the sampling information, then a higher weight is assigned to the prior information. On the other hand, if the variance of the sample information is smaller than the variance of the prior information, then a higher weight is assigned to the sample information causing the posterior estimate to be closer to the sample information. The Bayesian approach discussed in this dissertation enables stronger solutions to one of the biggest problems faced by the software engineering community: the challenge of making good decisions using data that is usually scarce and incomplete. The Bayesian approach yields higher prediction accuracy for COCOMO II versus the multiple 2

15 regression approach when validated on a dataset of 161 datapoints. A seven-step modeling methodology that shows the step-by-step process of developing an empirical software-engineering model using the Bayesian approach is presented. An expertjudgment calibrated quality model extension to COCOMO II, namely COQUALMO that can be used to do cost/schedule/quality tradeoffs is also described. The important question answered in this dissertation is: Is the Bayesian approach better than the multiple regression approach for software engineering data used for the COCOMO II calibration? The results in this dissertation show that the predictive performance of the Bayesian approach on our latest sample of 161 datapoints is significantly better than that of the multiple regression approach. The Bayesian calibrated model using 83 projects yields PRED(.30) of 66% when validated against 161 projects (i.e. 106/161 projects (or 66%) are estimated within 30% of the actuals) whereas the multiple-regression approach yields PRED(.30) of only 44% on the same data. 3

16 CHAPTER 2: Existing Software Estimation Models and Techniques 2.1 Introduction Budgeting, project planning and control, tradeoff and risk analyses, software return on investment analyses are among the many uses of software engineering cost, schedule and quality models and estimation. In this chapter, the leading techniques and their relative strengths for these purposes are summarized. Significant research on software cost modeling began with the extensive 1965 SDC study of the 104 attributes of 169 software projects [Nelson, 1966]. This led to the development of some useful partial models in the late 1960 s and early 1970 s. In the late 1970 s more robust models such as SLIM [Putnam, 1992], Checkpoint [Jones, 1997], PRICE-S [Park, 1988], SEER [Jensen, 1983], and COCOMO [Boehm, 1981] were developed. Although most of these models were developed at about the same time, they all faced the same dilemma: as software grew in size and importance it also grew in complexity, making it very difficult to accurately predict the cost, schedule and/or the quality of software product. This dynamic field of software estimation sustained the interests of these researchers who succeeded in setting the stepping stones of software engineering cost models. The most commonly used techniques for these models include classical multiple regression approaches. However, these classical model-building techniques are not necessarily the best when used on software engineering data. Beyond regression, several papers discuss the pros and cons of one software cost estimation technique versus another and present analysis results [Briand, 1992, Khoshgoftaar, 1995]. In contrast, this chapter 4

17 focuses on the classification of existing techniques into six major categories as shown in figure 2.1, providing an overview with examples of each category. In section 2.2, the first category is discussed in more depth with a comparison of the most popular cost models. Sections discuss the remaining techniques and section 2.8 presents some conclusions on the six techniques. Figure 2.1: Software Estimation Techniques Classification Software Estimation Techniques Model-Based - Learning-Oriented - Regression- Composite - e.g. SLIM, COCOMO, Checkpoint, SEER e.g. Expertise-Based - Delphi, Rule-Based Neural, Case-based Dynamics-Based - e.g. Abdel - Hamid- Madnick Based - e.g. OLS, Robust Bayesian - e.g. COCOMOII, Reliability Growth Models 2.2 Model-Based Techniques As discussed above, quite a few software estimation models have been developed in the last few decades. Many of them are proprietary models and hence cannot be compared and contrasted in terms of the model structure. Theory or experimentation determines the functional form of these models. This section discusses a few of the popular models and where ever appropriate a discussion of the quality management model is also included. Section gives a brief description of reliability growth models that also fit into the category of model-based techniques. Even though, reliability growth models are not useful for in-process quality management and do not cover the early development activities, many such models exist and can be exploited for quality management at the back end of the development process. 5

18 2.2.1 The Putnam Software Life-cycle Model (SLIM) Larry Putnam of Quantitative Software Measurement developed the Software Life-cycle Model (SLIM) in the late 1970s [Putnam, 1992]. SLIM is based on Putnam s analysis of the life-cycle in terms of a Rayleigh distribution of project personnel level versus time. It supports most of the popular size estimating methods including ballpark techniques, source instructions, function points, etc. It makes use of a Rayleigh curve to estimate project effort, schedule and defect rate. A Manpower Buildup Index (MBI) and a Technology Constant or Productivity factor (PF) are used to influence the shape of the curve. SLIM can record and analyze data from previously completed projects which is then used to calibrate the model; or if data is not available then a set of questions can be answered to get values of MBI and PF from the existing database. Figure 2.2: The Rayleigh Model 15 D dy dt = 2Kate -at t =0 td Time K =1.0 a = 0.02 t d = 0.18 In SLIM, Productivity is used to link the basic Rayleigh manpower distribution model to the software development characteristics of size and technology factors. 6

19 Productivity, P, is the ratio of software product size, S, and development effort, E. That is, P S = Eq. 2.1 E The Rayleigh curve used to define the distribution of effort is modeled by the differential equation dy dt = 2Kate at 2 Eq. 2.2 An example is given in figure 2.2, where K = 1.0, a = 0.02, t d = Different values of K, a and t d will give different sizes and shapes of the Rayleigh curve. Putnam assumes that the peak staffing level in the Rayleigh curve corresponds to development time (t d ). And E was found to be approximately 40% of K, where K is the total life-cycle effort. From data analysis, Putnam found that when productivity was high the initial staff buildup was lower than for projects with lower productivity. Putnam associated initial staff buildup with the difficulty, D, of the project. In the Rayleigh curve shown above, D, the slope of the curve at t=0, is defined as: K D = Eq. 2.3 () td 2 And the relationship between difficulty, D, and the productivity, P is: But, P = αd 2 3 Eq. 2.4 P S = and E = 0.4K Eq. 2.5 E 7

20 So substituting equation 2.3 into equation 2.4, then substituting the result along with equation 2.5 into equation 2.1 gives: S 04. K K = 2 3 α 2 Eq. 2.6 ( td) Solving equation 2.6 for S yields: S = 04. α K td Eq. 2.7 Thus, the cube root of total life-cycle effort K can be formulated as: K S = Eq α( ) t d Putnam suggests 0.4α in equation 2.8 is a technology factor, C, which accounts for differences among projects and based on his study he has twenty different values for C ranging from 754 to 3,524,578. However, one s value of C may depend more on people and complexity factors than technology factors. Substituting C for 0.4α gives K 3 S = C t d Eq. 2.9 Substituting equation 2.9 into equation 2.5 for development effort E we finally get E 3 S = 04 C 1. 4 Eq t d Some Rayleigh curve assumptions do not always hold in practice (e.g. flat staffing curves for incremental development; less than t 4 effort savings for long schedule stretchouts). Putnam has developed several model adjustments for these situations. 8

21 For the Quality model in SLIM, Putnam assumed the Rayleigh Defect Rate curve based on Trachtenburg s study at RCA in 1982 of more than 25 software reliability models [Trachtenburg, 1982]. The Rayleigh equation for the error curve is modeled as Em= ( 6Er / td 2 ) texp( 3t 2 / td 2 ) Eq where: E r = Total number of errors expected over the life of the project E m = Errors per month t = instantaneous elapsed time throughout the life cycle t d = elapsed time at milestone 7, the 95% reliability level Figure 2.3: The Trachtenburg Reliability Rayleigh Curve Defects Over Time Error Rate ÃÃ ÃÃ ÃÃ ÃÃÃÃ ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ Time (t) Putnam approximated the error curve shown in figure 2.3. Most of the available defect data was in aggregate form; it represented activities from system integration test to product delivery (i.e. the end of development). Putnam integrated the area under the 9

22 Rayleigh curve and realized that these activities comprised 17% of the total area. He used this result to compute the total number of defects under the curve. Thus a flaw of Putnam s Defect model (like many of the Reliability models that Trachtenburg studied) is that it uses data known only after testing begins. Recently, Quantitative Software Management has developed a set of three tools based on Putnam s SLIM. These include SLIM-Estimate, SLIM-Control and SLIM- Metrics. SLIM-Estimate is a project planning tool, SLIM-Control project tracking and oversight tool, SLIM-Metrics is a software metrics repository and benchmarking tool. More information on these SLIM tools can be found at The Jensen Model The Jensen model [Jensen, 1983] is very similar to the Putnam SLIM model described above. Jensen proposed S = C TK te 1 2 Eq where E 2 S 1 = Eq Cte T Jensen s Effective Technology Constant, C te, is a slight variation of Putnam s technology factor. Jensen s C te is a product of a basic technology constant and several adjustment factors (similar to the Intermediate COCOMO 81 form of Effort Adjustment Factor [Boehm, 1981]). The adjustment factors account for differences in product, personnel and computer factors among different software products. 10

23 2.2.3 The Bailey-Basili Model John Bailey and Vic Basili [Bailey, 1981] attempted to present a model generation process for developing a local resource estimation model. The process consists of 3 steps i. Compute background equation ii. Determine factors explaining the differences between actual project data and the mean of the estimated derived by the background equation. iii. Use model to predict new project. The background equation or the baseline relationship between effort and size was determined using 18 datapoints from the NASA SEL (Software Engineering Lab) database. It was formulated as: Effort (in Man Months) = 0.73 (Size in Delivered SLOC) Eq This equation can be used to predict the effort required for an average project. The next step in the process is to determine a set of factors that differentiates one project from another and helps explain the difference between actual effort v/s effort estimated by the background equation. Bailey and Basili identified close to 100 environmental attributes as possible contributors to the variance in the predicted effort. They also noted that using so many attributes with only 18 data points was not feasible and identified techniques of selecting only the most influential ones. They recognized that determining a subset of the attributes could be done by expert intuition, factor analysis or by the use of correlation matrices. They also grouped attributes in a logical way so that the group had either a positive or a negative impact on effort and could be 11

24 easily explained. They finally settled upon 3 groups using only 21 of the original attributes. The groups are shown below: (I) Total Methodology (METH) Tree Charts Top Down Design Design Formalisms Code Reading Chief Programmer Teams Formal Test Plans Unit Development Folders Formal Training (II) Cumulative Complexity (CMPLX) Customer Interface Complexity Customer-Initiated Design Changes Application Process Complexity Program Flow Complexity Internal Communication Complexity External Communication Complexity Data Base Complexity (III) Cumulative Experience (CEXP) Programmer Qualifications Programmer Experience with Machine Programmer Experience with Language Programmer Experience with Application Team Previously Worked Together Each of these groups was rated on a scale from 1 to 5 and then SPSS, a popular statistics package, was used to run multiple regression on the several combinations of the attributes such as Effort = (Size) A * METH Effort = (Size) A * METH*CMPLX Effort = (Size) A * METH*CMPLX*CEXP 12

25 Bailey and Basili concluded that none of the model types they investigated was better than the rest. They concluded that as more data becomes available the model structures should be further investigated and the model with highest prediction accuracy should be determined. This model can then be used for predicting a new project Checkpoint Checkpoint is a knowledge-based software project estimating tool from Software Productivity Research (SPR) developed from Capers Jones studies [Jones, 1997]. It has a proprietary database of thousands of software projects and it focuses on four areas that need to be managed to improve software quality and productivity. It uses Function Points (or Feature Points) as its primary input of size. SPR s Summary of Opportunities for software development is shown in figure 2.4. QA stands for Quality Assurance; JAD for Joint Application Development; SDM for Software Development Metrics. Figure 2.4: SPR s Summary of Opportunities * Develop Tool Strategy * Establish QA & Measurement Specialists * Increase Project Management Experience People Management Technology Software Quality And Productivity Development process * Develop Measurement Program * Increase JADS & Prototyping * Standardize Use of SDM * Establish QA Programs Environment * Establish Communication Programs * Improve Physical Office Space * Improve Partnership with Customers 13

26 It focuses on three main capabilities for supporting the entire software development life-cycle as discussed briefly at SPR s website, and outlined here: Estimation: Checkpoint predicts effort at four levels of granularity: project, phase, activity, and task. Estimates also include resources, deliverables, defects, costs, and schedules. Measurement: Checkpoint enables users to capture project metrics to perform benchmark analysis, identify best practices, and develop internal estimation knowledge bases (known as Templates). Assessment: Checkpoint facilitates the comparison of actual and estimated performance to various industry standards included in the knowledge base. Checkpoint also evaluates the strengths and weaknesses of the software environment. Process improvement recommendations can be modeled to assess the costs and benefits of implementation. Other relevant tools that SPR has been marketing for a few years include Knowledge Plan and Function Point (FP) Workbench. Knowledge Plan is a tool that enables an initial software project estimate with limited effort and guides the user through the refinement of What if? processes. Function Point Workbench is a tool that expedites function point analysis by providing facilities to store, update and analyze individual counts. 14

27 2.2.5 PRICE-S The PRICE-S model was originally developed at RCA for use internally on software projects such as some that were part of the Apollo moon program. It was then released in 1977 as a proprietary model and used for estimating several US DoD, NASA and other government software projects. The model equations were not released in the public domain, although a few of the model s central algorithms were published by Park in [Park, 1988]. The tool continued to become popular and is now marketed by PRICE Systems, which is a privately held company formerly affiliated with Lockheed Martin. As published on PRICE Systems' website ( the PRICE-S Model consists of three submodels that enable estimating costs and schedules for the development and support of computer systems. These three submodels and their functionalities are outlined below: The Acquisition Submodel: This submodel forecasts software costs and schedules. The model covers all types of software development, including business systems, communications, command and control, avionics, and space systems. PRICE-S addresses current software issues such as reengineering, code generation, spiral development, rapid development, rapid prototyping, object-oriented development, and software productivity measurement. The Sizing Submodel: This submodel facilitates estimating the size of the software to be developed. Sizing can be in Source Lines of Code (SLOC), Function Points and/or Predictive Object Points (POPs). POPs is a new way of sizing object oriented development projects and was introduced in [Minkiewicz, 1998] based on 15

28 previous work one in Object Oriented (OO) metrics done by Chidamber et al and others [Chidamber, 1994, Henderson, 1996]. The Life-cycle Cost Submodel: This submodel is used for rapid and early costing of the maintenance and support phase for the software. It is used in conjunction with the Acquisition Submodel, which provides the development costs and design parameters. PRICE Systems continues to update their model to meet new challenges. Recently, they have added Foresight 2.0, the newest version of their software solution for forecasting time, effort and costs for commercial and non-military government software projects ESTIMACS The Estimacs model was developed by Howard Rubin of Hunter College in the early 1980s [Rubin, 1983]. Its main focus is on the development phase of the system lifecycle. ESTIMACS stresses approaching the estimating task in business terms. It also stresses the need to be able to do sensitivity and trade-off analyses early on, not only for the project at hand, but also for how the current project will fold into the long term mix or portfolio of projects on the developer s plate for up to the next ten years, in terms of staffing/cost estimates and associated risks. Rubin has identified six important dimensions of estimation (effort hours, staff size and deployment, cost, hardware resource requirements, risk, portfolio impact) and a map (see figure 2.5) showing their relationships, all the way from what he calls the gross business specifications through to their impact on the developer s long term projected portfolio mix. These business specifications, or project factors, drive the estimate dimensions. 16

29 The appeal of ESTIMACS lies in its ability to do estimates and sensitivity analyses of hardware as well as software, and from its business-oriented planning approach. Figure 2.5: Rubin s Map of Relationships of Estimation Dimensions Gross business specifications Effort hours Hardware resources Staff Cost Elapsed time Risk Portfolio impact SEER-SEM SEER-SEM is the System Evaluation and Estimation of Resources - Software Estimation model offered by Galorath, Inc. of El Segundo, California ( This proprietary model has been available for nearly fifteen years and is based on the original Jensen model [Jensen, 1983]. The model covers all phases of the software project life-cycle, from early specification through design, development, delivery and maintenance. Figure 2.6 adapted from a Galorath illustration shows the several categories of model inputs and outputs. 17

30 Figure 2.6: SEER-SEM Inputs and Outputs Inputs Size Effort Outputs Personnel Cost Schedule Environment SEER-SEM Risk Complexity Maintenance Constraints Reliability Aside from SEER-SEM, Galorath, Inc. offers a suite of many tools addressing hardware as well as software concerns SOFTCOST The original SOFTCOST mathematical model was developed for NASA in 1981 by Dr. Robert Tausworthe of JPL [Tausworthe, 1981]. This model has been enhanced using the research results of Boehm, Doty, Putnam, Waltson-Felix and Wolverton. The most debated property of this model is its linear relationship between effort and size. A proprietary set of Softcost-based models (Softcost-R, Softcost-Ada, Softcost-OO) was also developed by Reifer Consultants Inc. [Reifer, 1989, ReiferA, 1991, ReiferB, 1991]. None of these models predicted the quality of the software under development. 18

31 2.2.9 COCOMO II The COCOMO (COnstructive COst MOdel) cost and schedule estimation model was originally published in [Boehm, 1981]. It became one of most popular parametric cost estimation models of the 1980s. But COCOMO 81 along with its 1987 Ada update experienced difficulties in estimating the costs of software developed to new life-cycle processes and capabilities. The COCOMO II research effort was started in 1994 at USC to address the issues on non-sequential and rapid development process models, reengineering, reuse driven approaches, object oriented approaches etc. COCOMO II was initially published in the Annals of Software Engineering in 1995 [Boehm, 1995]. The model has three submodels, Applications Composition, Early Design and Post-Architecture, which can be combined in various ways to deal with the current and likely future software practices marketplace. The Application Composition model is used to estimate effort and schedule on projects that use Integrated Computer Aided Software Engineering tools for rapid application development. These projects are too diversified but sufficiently simple to be rapidly composed from interoperable components. Typical components are GUI builders, database or objects managers, middleware for distributed processing or transaction processing, etc. and domain-specific components such as financial, medical or industrial process control packages. The Applications Composition model is based on Object Points [Banker, 1992, Kauffman, 1993]. Object Points are a count of the screens, reports and 3 GL language modules developed in the application. Each count is weighted by a threelevel; simple, medium, difficult; complexity factor. This estimating approach is 19

32 commensurate with the level of information available during the planning stages of Application Composition projects. The Early Design model involves the exploration of alternative system architectures and concepts of operation. Typically, not enough is known to make a detailed fine-grain estimate. This model is based on function points (or lines of code when available) and a set of five scale factors and 7 effort multipliers. The Post-Architecture model is used when top level design is complete and detailed information about the project is available and as the name suggests, the software architecture is well defined and established. It estimates for the entire development lifecycle and is a detailed extension of the Early-Design model. This model is the closest in structure and formulation to the Intermediate COCOMO 81 and Ada COCOMO models. It uses Source Lines of Code and/or Function Points for the sizing parameter, adjusted for reuse and breakage; a set of 17 effort multipliers and a set of 5 scale factors, that determine the economies/diseconomies of scale of the software under development. The 5 scale factors replace the development modes in the COCOMO 81 model and refine the exponent in the Ada COCOMO model. The Post-Architecture Model has been calibrated to a database of 161 projects collected from Commercial, Aerospace, Government and non-profit organizations using the Bayesian approach discussed further in section 2.6, Composite Techniques. The Early Design Model calibration is obtained by aggregating the calibrated Effort Multipliers of the Post-Architecture Model as described in [CSE, 1997]. The Scale Factor calibration is the same in both the models. Unfortunately, due to lack of data, the 20

33 Application Composition model has not yet been calibrated beyond an initial calibration to the [Kauffman, 1993] data. A primary attraction of the COCOMO models is their fully-available internal equations and parameter values. Over a dozen commercial COCOMO 81 implementations are available; one (Costar) also supports COCOMO II: for details, see the COCOMO II website Software Reliability Growth Models A common definition of reliability growth (according to [Kan, 1996]) is that the successive inter-failure times tend to become larger as faults are removed. Hence, if interfailure times are denoted by a sequence of random variables T 1, T 2,., then T i st T j for all i < j Eq where st means stochastically smaller than which means that for all v > 0, P{T i <v} P{T j <v}. The earliest and most referenced model on reliability growth is the Jelinski- Moranda model [Moranda, 1972]. It computes Mean-Time-To-Failure (MTTF) as MTTF = R( t ) dt Eq where R(t) is the probability that no errors will occur from time 0 to time t. If F(t) is the failure function, i.e. the probability that a failure will occur from time 0 to time t, then F(t) = 1 - R(t). If f(t) is the probability density of F(t) then df( t ) dr(t ) f( t) = = Eq dt dt 21

34 If z(t) (popularly known as the hazard function in the software reliability modeling domain) is the conditional probability that an error occurs in the time interval between t and t+ t assuming that no error occurred before time t, and the error now occurs at T, then zt ( ) t = Pt { < T< t+ tt > t} Eq which is equivalent to zt ( ) t = P{t < T < t + t } P{T > t } = F( t + t) F( t) R(t ) Eq The MTTF in equation 2.16 can be estimated by observing the behavior of a software product over a time period and plotting the time between consecutive errors. Hopefully, the reliability growth trend will be observed i.e. as errors are detected and removed, the time between successive errors reduces. Then, this phenomenon can be extrapolated to estimate the MTTF at any point in time and can also be used to estimate the total number of errors (which is the number of errors from time 0 to time ). The most important factor of this model is identifying the hazard function z(t). The Jelinski-Moranda model assumes that z(t) is constant until an error is discovered and corrected (assuming that no time elapses between error discovery and error removal and that no new errors are introduced when an error is removed). At this time, z(t) is reduced and held constant until another error is corrected. Another assumption made by the model is that z(t) is proportional to the number of residual errors i.e. z(t) = K(N-i) where K is a proportionality constant, N is the unknown number of initial errors in the software 22

35 product and i is the number of errors corrected by time t. The hazard function proposed by this model is depicted in figure 2.7. Figure 2.7: Hazard function proposed by the Jelinski-Moranda Reliability Growth Model K z(t) NK t t is either program operating time or calendar time The parameters N and K are computed by using calibration on data from errors that have already been detected. Other hazard functions can also be used to better suit the environment. For example, a triangular hazard function that indicates an initial increase in failure rate followed by the reliability growth function has been proposed. The Jelinski-Moranda model formed the framework for many software reliability models [Littlewood, 1973, Littlewood, 1978, Musa 1984, Goel, 1979 etc.] that have been developed in the past three decades. Farr conducted a survey in 1983 and presented many of the variations and alternative ways of estimating software reliability [Farr, 1983]. 23

36 Summary of Model Based Techniques Overall, model based techniques are good for budgeting, tradeoff analysis, planning and control, and investment analysis. Most of the existing software estimation models fit under this category but since they are calibrated to past experience, their primary difficulty is with unprecedented situations. 2.3 Expertise-Based Techniques Expertise-based techniques are useful in the absence of quantified, empirical data and are based on prior knowledge of experts in the field [Boehm, 1981]. Based on their experience and understanding of the proposed project, experts arrive at an estimate of the cost/schedule/quality of the software under development. The obvious drawback to this method is that an estimate is only as good as the expert s opinion. The pros and cons of this technique are complementary to those of Model-based techniques and are shown in table 2.1. Table 2.1: Pros and Cons of Expertise-Based Techniques Pros Easily incorporates knowledge of differences between past project experiences Assessment of exceptional circumstances, interactions and representativeness Cons No better than the experts Estimates can be biased Subjective estimates that may not be analyzable Examples of Expertise-based techniques include the Delphi technique, Rule- Based Systems and the Work Breakdown Structure each of which are described in the following subsections. 24

37 2.3.1 The Delphi Approach The Delphi approach was originated at The Rand Corporation in 1948 originally as a way of making predictions about future events - thus its name, recalling the divinations of the Greek oracle of antiquity, located on the southern flank of Mt. Parnassos at Delphi. Since then, it has been used as an effective way of getting group consensus. It alleviates the problem of individual biases and results in an improved group consensus estimate. Farquhar performed an experiment at Rand Corporation in 1970 where he gave 4 groups the same software specification and asked the groups to estimate the effort it took to develop the product [Farquhar, 1970.] Two groups used the Delphi technique and two groups had meetings. The groups that had meetings came up with an extremely accurate estimate as compared to the groups that used the Delphi technique. To improve the estimate consensus obtained by the Delphi technique, Boehm and Farquhar formulated an alternative method, the wideband Delphi technique [Boehm, 1981]. The wideband Delphi approach is described in Table 2.2. Table 2.2: Wideband Delphi Approach 1 Coordinator provides Delphi instrument to each of the participants to review. 2 Coordinator conducts a group meeting to discuss related issues. 3 Participants complete the Delphi forms anonymously and return it to the Coordinator. 4 Coordinator feeds back results of participants responses. 5 Coordinator conducts another group meeting to discuss variances in the participants responses to achieve a possible consensus. 6 Coordinator asks participants for re-estimates, again anonymously, and steps 4-6 are repeated for as many times as appropriate. Chulani and Boehm also used the Delphi approach to specify the prior information required for the Bayesian calibration of COCOMO II (see chapter 4, section 25

38 4.2 and appendix A). Chulani and Boehm further used the technique to estimate software defect introduction and removal rates during various activities of the software development life-cycle. These factors appear in COQUALMO (COnstructuve QUALity MOdel), which predicts the residual defect density in terms of number of defects/unit of size (see chapter 4, section 4.3 and appendix A) Rule-based Systems This technique has been adopted from the Artificial Intelligence domain where a known fact fires up rules which in turn may assert new facts. And the system can be used for estimation when no further rules are fired up from known (or new) facts. An example of a rule-based system developed by Madachy is shown [Madachy, 1997]: If Required Software Reliability = Very High AND Personnel Capability = Low; then Risk Level = High Work Breakdown Structure (WBS) This technique of software estimating involves breaking down the product to be developed into smaller and smaller components until the components can be independently estimated. The estimation can be based on analogy from an existing database of completed components, or can be estimated by experts, or by using the Delphi technique described above. Once all the components have been estimated, a project-level estimate can be derived by rolling-up the estimates. As discussed in [Boehm, 1981], a software WBS consists of two hierarchies, one representing the software product itself, and the other representing the activities needed to build that product. The product hierarchy (figure 2.8) describes the fundamental 26

39 structure of the software, showing how the various software components fit into the overall system. The activity hierarchy (figure 2.9) indicates the activities that may be associated with a given software component. Figure 2.8: A Product Work Breakdown Structure Software Application Component A Component B Component N Subcomponent B1 Subcomponent B2 Figure 2.9: An Activity Work Breakdown Structure Development Activities System Engineering Programming Maintenance Detailed Design Code and Unit Test 27

40 2.4 Learning-Oriented Techniques Learning-oriented techniques use prior and current knowledge to develop a software estimation model. Neural networks and Case-Based Reasoning are examples of Learning-Oriented Techniques Neural Networks In the last decade, there has been significant effort put into the research of developing software estimation models using neural networks. Many researchers [Khoshgoftaar, 1995] realized the deficiencies of multiple-regression methods (see section 2.6) and explored neural networks as an alternative. Neural networks are based on the principle of learning from example; no prior information is specified (like in the Bayesian approach discussed in section 2.7). Neural networks are characterized in terms of three entities, the neurons, the interconnection structure and the learning algorithm [Karunanithi, 1992]. Most of the software models developed using neural networks use backpropogation trained feed-forward networks (see figure 2.10). As discussed in [Gray, 1997], these networks are architected using an appropriate layout of neurons. The network is trained with a series of inputs and the correct output from the training data so as to minimize the prediction error. Once the training is complete, and the appropriate weights for the network arcs have been determined, new inputs can be presented to the network to predict the corresponding estimate of the response variable. Wittig developed a software estimation model using connectionist models (synonymous with neural networks as referred in this section) and derived very high 28

41 prediction accuracies [Wittig, 1994]. Although, Wittig s model has accuracies within 10% of the actuals for its training dataset, the model has not been well-accepted by the community due to its lack of explanation. Neural networks operate as black boxes and do not provide any information or reasoning about how the outputs are derived. And since software data is not well-behaved it is hard to know whether the well known relationships between parameters are satisfied with the neural network or not. For example, the data in the current COCOMO II database says that developing for reuse causes a decrease in the amount of effort it takes to develop the software product. This is in contradiction to both theory and other data sources [Poulin, 1997, Reifer, 1997] that if you re developing for future reuse more effort is expended in making the components more independent of other components. Figure 2.10: A Neural Network Estimation Model Data Inputs Estimation Algorithms Project Size Model Output Complexity Languages Effort Estimate Skill Levels Actuals Training Algorithm 29

42 Neural networks have been extensively used in the software reliability modeling domain. In particular, [Lyu, 1996] discusses how neural networks can be used for reliability growth modeling and for identifying high defect-prone modules in the software product. To show the effectiveness of neural networks on reliability growth modeling, Lyu performed an experiment with five well-known reliability growth models using actual software project data. His results showed that neural networks improved the accuracy of the five modes considered. Lyu also discusses how neural networks can be used to identify high fault (defect) prone modules in a software product. Koshgoftaar was the first to demonstrate this classification of fault-prone software modules [Koshgoftaar, 1993]. Lyu evaluates neural networks developed using a perceptron network and an adapted cascade-correlation algorithm [Karunanithi, 1992] and classifies faults as type I (low-fault-prone) and type II (high-fault-prone). He found that the perceptron network worked effectively in classifying type I faults and the adapted cascade-correlation algorithm network worked effectively in classifying type II faults. For more details, please read chapter 17, Neural Networks for Software Reliability Engineering of [Lyu, 1996] Case-Based Reasoning (CBR) Case-based reasoning is an enhanced form of estimation by analogy [Boehm, 1981]. A database of completed projects is referenced to relate the actual costs to an estimate of the cost of a similar new project. Thus a sophisticated algorithm needs to exist 30

43 which compares completed projects to the project that needs to be estimated. After the current project is completed, it must be included in the database to facilitate further usage of the case-based reasoning approach. Case-based reasoning can be done either at the project level or at the sub-system level. Case studies represent an inductive process, whereby estimators and planners try to learn useful general lessons and estimation heuristics by extrapolation from specific examples. Shepperd and Schofield did a study comparing the use of analogy with prediction models based upon stepwise regression analysis for nine datasets (a total of 275 projects), yielding higher accuracies for estimation by analogy. They developed a five-step process for estimation by analogy: i. identify the data or features to collect ii. agree data definitions and collections mechanisms iii. populate the case base iv. tune the estimation method v. estimate the effort for a new project For further details, the reader is urged to refer to [Shepperd, 1997]. 2.5 Dynamics-Based Techniques Many of the current software cost estimation models lack the ability to estimate project activity distribution of effort and schedule based on project characteristics. Price- S [Frieman, 1979] and Detailed-COCOMO [Boehm, 1981] attempted at predicting effort with phase-sensitive effort multipliers. Detailed-COCOMO provides a set of phasesensitive effort multipliers for each cost driver attribute. The overall effort estimate using 31

44 Detailed-COCOMO is not significantly higher than overall effort estimate using the simpler Intermediate-COCOMO. But, the Detailed-COCOMO phase distribution estimates are better. Forrester pioneered the work on systems dynamics by formulating models using continuous quantities (e.g. levels, rates etc.) interconnected in loops of information feedback and circular causality [Forrester, 1961, Forrester, 1968]. He referred to his research as simulation methodology. Abdel-Hamid and Stuart Madnick enhanced Forrester s research and developed a model that estimates the time distribution of effort, schedule and residual defect rates as a function of staffing rates, experience-mix, training rates, personnel turnover, defect introduction rates etc. [Abdel-Hamid, 1991]. Since then systems dynamics has been described as a simulation methodology for modeling continuous systems. Lin and a few others [Lin, 1992] modified the Abdel-Hamid- Madnick model to support process and project management issues. Madachy [Madachy, 1994] developed a dynamic model of an inspection-based software life cycle process to support quantitative evaluation of the process. The system dynamics approach involves the following concepts [Richardson, 1991]: - defining problems dynamically, in terms of graphs over time - striving for an endogenous, behavioral view of the significant dynamics of a system - thinking of all real systems concepts as continuous quantities interconnected in information feedback loops and circular causality 32

45 - identifying independent levels in the system and their inflow and outflow rates - formulating a model capable of reproducing the dynamic problem of concern by itself - deriving understandings and applicable policy insights from the resulting model - implementing changes resulting from model-based understandings and insights. Mathematically, system dynamics simulation models are represented by a set of first-order differential equations [Madachy, 1994]: x ( t) = f ( x, p) Eq where: x = a vector describing the levels (states) in the model p = a set of model parameters f = a nonlinear vector function t = time 2.6 Regression-Based Techniques Regression-based techniques are the most popular ways of building models and are used to calibrate almost all of the models described in section 2.2. These techniques include the Standard or "Ordinary Least Squares" regression and the Robust regression approaches, etc Standard Regression Ordinary Least Squares (OLS) method Standard regression refers to the classical statistical approach of general linear regression modeling using least squares. It is based on the Ordinary Least Squares (OLS) method discussed in many books such as [Judge, 1985, Judge, 1993, Weisberg, 1985]. 33

46 The reasons for its popularity include ease of use and simplicity. It is available as an option in several commercial statistical packages such as Minitab, SPlus, SPSS, etc. A model using the OLS method can be written as y = β x Bx e t 1+ β t 2... Eq k tk t where x t2 x tk are predictor (or regressor) variables for the t th observation, β 2... β κ are response coefficients, β 1 is an intercept parameter and y t is the response variable for the t th observation. The error term, e t is a random variable with a probability distribution (typically normal). The OLS method operates by estimating the response coefficients and 2 the intercept parameter by minimizing the least squares error term r i where r i is the difference between the observed response and the model predicted response for the i th observation. Thus all observations have an equivalent influence on the model equation. Hence, if there is an outlier in the observations then it will have an undesirable impact on the model. The OLS method is well-suited when: (i) a lot of data is available. This indicates that there are many degrees of freedom available and the number of observations is many more than the number of variables to be predicted. Collecting data has been one of the biggest challenges in this field due to lack of funding by higher management, co-existence of several development processes, lack of proper interpretation of the process, etc. (ii) no data items are missing. Data with missing information could be reported when there is limited time and budget for the data collection activity; or due to lack of understanding of the data being reported. 34

47 (iii) there are no outliers. Extreme cases are very often reported in software engineering data due to misunderstandings or lack of precision in the data collection process, or due to different development processes. (iv) the predictor variables are not correlated. Most of the existing software estimation models have parameters that are correlated to each other. This violates the assumption of the OLS approach. (v) the predictor variables have an easy interpretation when used in the model. This is very difficult to achieve because it is not easy to make valid assumptions about the form of the functional relationships between predictors and their distributions. Each of the above is a challenge in modeling software engineering data sets to develop a robust, easy-to-understand, constructive cost estimation model. A variation of the above method was used to calibrate the 1997 version of COCOMO II. Multiple regression was used to estimate the β coefficients associated with the five scale factors and 17 effort multipliers. Some of the estimates produced by this approach gave counter intuitive results. For example, the data analysis indicated that developing software to be reused in multiple situations was cheaper than developing it to be used in a single situation: hardly a credible predictor for a practical cost estimation model. For the 1997 version of COCOMO II, a pragmatic 10% weighted average approach was used. COCOMO II.1997 ended up with a 0.9 weight for the expert data and a 0.1 weight for the regression data. This gave moderately good results for an interim 35

48 COCOMO II model, with no cost drivers operating in non-credible ways. Please refer to chapter 4, section for further details on the calibration of COCOMO II Robust Regression Robust Regression is an improvement over the standard OLS approach. It alleviates the common problem of outliers in observed software engineering data. Software project data usually have a lot of outliers due to disagreement on the definitions of software metrics, coexistence of several software development processes and the availability of qualitative versus quantitative data. There are several statistical techniques that fall in the category of Robust Regression. One of the techniques is based on Least Median Squares method and is very similar to the OLS method described above. The only difference is that this technique 2 reduces the median of all the r i. Another approach that can be classified as Robust regression is a technique that uses the datapoints lying within two (or three) standard deviations of the mean response variable. This method automatically gets rid of outliers and can be used only when there is a sufficient number of observations, so as not to have a significant impact on the degrees of freedom of the model. Although this technique has the flaw of eliminating outliers without direct reasoning, it is still very useful for developing software estimation models with few regressor variables due to lack of complete project data. Most existing parametric cost models (COCOMO II, SLIM, Checkpoint etc.) use some form of regression-based techniques due to their simplicity and wide acceptance. 36

49 2.7 Composite Techniques As discussed above there are many pros and cons of using each of the existing techniques for cost estimation. Composite techniques incorporate a combination of two or more techniques to formulate the most appropriate functional form for estimation. The Bayesian approach which combines the expertise-based and model-based techniques is an example of composite techniques The Bayesian Approach An attractive estimating approach that has been used for the development of the COCOMO II model is Bayesian analysis (see chapter 4, section 4.2.2). Bayesian analysis is a well-defined and rigorous process of inductive reasoning that has been used in many scientific disciplines [Gelman, 1995, Zellner, 1983, Box, 1973]. A distinctive feature of the Bayesian approach is that it permits the investigator to use both sample (data) and prior (expert-judgement) information in a logically consistent manner in making inferences. This is done by using Bayes theorem to produce a postdata or posterior distribution for the model parameters. Using Bayes theorem, prior (or initial) values are transformed to post-data views. This transformation can be viewed as a learning process. The posterior distribution is determined by the variances of the prior and sample information. If the variance of the prior information is smaller than the variance of the sampling information, then a higher weight is assigned to the prior information. On the other hand, if the variance of the sample information is smaller than the variance of the prior information, then a higher weight is assigned to the sample information causing the posterior estimate to be closer to the sample information. 37

50 The Bayesian approach provides a formal process by which a-priori expertjudgement can be combined with sampling information (data) to produce a robust a- posteriori model. Using Bayes theorem, we can combine our two information sources as follows: f ( β Y ) f ( Y β ) f ( β ) = Eq f ( Y ) where ß is the vector of parameters in which we are interested and Y is the vector of sample observations from the joint density function f ( β Y ). In equation 2.22, f ( β Y ) is the posterior density function for ß summarizing all the information about ß, f ( Y β ) is the sample information and is algebraically equivalent to the likelihood function for ß, and f ( β ) is the prior information summarizing the expert-judgement information about ß. Equation 2.22 can be rewritten as: f ( β Y ) l( β Y ) f ( β ) Eq In words, equation 2.23 means Posterior Sample Pr ior where " " is the proportionality symbol. In the Bayesian analysis context, the prior probabilities are the simple unconditional probabilities associated with the sample information, while the posterior probabilities are the conditional probabilities given knowledge of sample and prior information. 38

51 The Bayesian approach makes use of prior information that is not part of the sample data by providing an optimal combination of the two sources of information. As described in many books on Bayesian analysis [Leamer, 1978, Box, 1973], the posterior mean, b**, and variance, Var(b**), are defined as: and b 1 = [ X X + H ] X Xb+ H b s 2 s Eq ** * * * 1 Var( b ) = X X + H 2 s ** * 1 Eq where X is the matrix of predictor variables, s is the variance of the residual for the sample data; and H* and b* are the precision (inverse of variance) and mean of the prior information respectively. The Bayesian approach has been used for reliability growth models where an example prior subjective view is that if no failures occur while the software is being observed, the reliability of the software should increase. This is unlike what happens in the reliability growth models discussed in section that allow change in the predicted reliability only when an error occurs. The Littlewood-Verrall model [Littlewood, 1973, Littlewood, 1989] is a very good example of a reliability growth model based on the Bayesian approach. It models the fault generation process during the fault correction process (i.e. removing a fault may actually introduce new faults) by allowing for the probability that the software program could become less reliable than before. 39

52 Although, Littlewood et al have used the Bayesian approach for their reliability growth models, very little research has been done to prove that the model developed is significantly better than the models developed using commonly used regression techniques. And, reliability growth models are usually used for reliability assessment or for quality management at the back end of the development process and not for inprocess quality management. The Bayesian approach has not yet been used (until the work presented in this dissertation) for modeling the cost, schedule or quality of a software product during the entire development process i.e. during elaboration and construction of the software product. The Bayesian approach described above is the focus of this dissertation which has been used in the most recent calibration of COCOMO II on a database currently consisting of 161 project data points. The a-posteriori COCOMO II.1999 calibration gives predictions that are within 30% of the actuals 75% of the time on 161 datapoints, which is a significant improvement over the COCOMO II.1997 calibration which gave predictions within 30% of the actuals 52% of the time on 83 datapoints as shown in table 2.3. Note that the 1997 calibration was not performed using Bayesian analysis; rather, a 10% weighted linear combination of expert prior vs. sample information was applied (see chapter 4, section for further details on the 1997 calibration). If the model s multiplicative coefficient is calibrated to each of the major sources of project data, i.e., stratified by data source, the resulting model produces estimates within 30% of the actuals 80% of the time. It is therefore recommended that organizations using the model calibrate it using their own data to increase model accuracy and produce a local optimum 40

53 estimate for similar type projects. From table 2.3 it is clear that the prediction accuracy of the COCOMO II.1999 Bayesian model on 161 datapoints is better than the prediction accuracy of the COCOMO II.1997 weighted linear model on 83 datapoints. In chapter 4, further analysis on the same dataset is done, to strengthen the conclusion that the Bayesian approach yields better accuracies than the 10% weighted-average multiple regression approach. Table 2.3: Prediction Accuracy of COCOMO II.1997 vs. COCOMO II.1999 COCOMO II Prediction Accuracy Before Stratification After Stratification 1997 (83 PRED(.20) 46% 49% datapoints) PRED(.25) 49% 55% PRED(.30) 52% 64% 1999 (161 PRED(.20) 63% 70% datapoints) PRED(.25) 68% 76% PRED(.30) 75% 80% Bayesian analysis has all the advantages of Standard regression and it includes prior knowledge of experts. It attempts to reduce the risks associated with imperfect data gathering. Software engineering data is usually scarce and incomplete and estimators are faced with the challenge of making good decisions using this data. Classical statistical techniques described earlier derive conclusions based on the available data. But, to make the best decision it is imperative that in addition to the available sample data we should incorporate nonsample or prior information that is relevant. Usually a lot of good expert judgment based information on software processes and the impact of several parameters on effort, cost, schedule, quality etc. is available. This information doesn t necessarily get derived from statistical investigation and hence classical statistical techniques such as OLS do not incorporate it into the decision making process. Bayesian techniques make 41

54 best use of relevant prior information along with collected sample data in the decision making process to develop a stronger model. 2.8 Conclusions on Existing Software Estimation Techniques This chapter has presented an overview of a variety of software estimation techniques classifying them into six broad categories: model-based, expertise-based, learning-oriented, dynamics-based, regression-based and composite techniques. The pros and cons of each of these techniques have been discussed suggesting in which situation one technique might be more appropriate to use than another. It has also provided an overview of several popular estimation models currently available. The main conclusion we can draw from this chapter is the key to arriving at sound estimates is to use a variety of methods and tools and then investigating the reasons why the estimates provided by one might differ significantly from those provided by another. 42

55 CHAPTER 3: The Research Approach and Framework 3.1 Introduction This chapter describes the framework that is used for the purposes of our research. A simple seven-step modeling methodology that has been successfully used to develop COCOMO II and it quality model extension is described in section 3.2. Section 3.3 focuses on the Bayesian approach and discusses the use of the modeling methodology on a simple bivariate variation of COCOMO II. In section 3.4, the framework for the quality model extension (namely, COQUALMO - COnstructive QUALity MOdel) of COCOMO II is presented. 3.2 The Modeling Methodology This section outlines the 7-step process (shown in Figure 3.1) incorporated to develop COCOMO II, COQUALMO [Chulani, 1997A] and other extensions of COCOMO II, namely CORADMO (COnstructive Rapid Application Development MOdel) and COCOTS (COnstructive COmmercial Off The Shelf Integration Model) (more information on COCOMO II and its extensions can be found at This process (or methodology) can be used to develop other software estimation models. Step 1) Analyze literature for factors affecting the quantities to be estimated The first step in developing a software estimation model is determining the factors (or predictor variables) that affect the software attribute being estimated (i.e. the response variable). This can be done by reviewing existing literature and analyzing the influence of parameters on the response variable. 43

56 For the COCOMO II Post Architecture model, the twenty-two parameters were determined based on usage of the COCOMO 81 model and on the experience of a group of senior software cost analysts. For COQUALMO, the COCOMO II Post Architecture model parameters were used as a starting point to model the Defect Introduction Model (see table 3.1). Please read chapter 4, section 4.3 for further details on COQUALMO Figure 3.1: The Seven-Step Modeling Methodology Analyze existing literature Step 1 Perform Behavioral analyses Step 2 Identify relative significance Step 3 Perform expert-judgment Delphi assessment, formulate a-priori model Step 4 Gather project data Step 5 Determine Bayesian A- Posteriori model Step 6 Gather more data; refine model Step 7 44

57 Category Platform Product Personnel Project Table 3.1: Step 1-Factors affecting Cost and Quality COCOMO II and Cost/Quality Model Drivers Required Software Reliability (RELY) Data Base Size (DATA) Required Reusability (RUSE) Documentation Match to Life-Cycle Needs (DOCU) Product Complexity (CPLX) Execution Time Constraint (TIME) Main Storage Constraint (STOR) Platform Volatility (PVOL) Analyst Capability (ACAP) Programmer Capability (PCAP) Personnel Continuity (PCON) Applications Experience (AEXP) Platform Experience (PEXP) Language and Tool Experience (LTEX) Use of Software Tools (TOOL) Multisite Development (SITE) Required Development Schedule (SCED) Precedentedness (PREC) Development Flexibility (FLEX) Architecture/Risk Resolution (RESL) Team Cohesion (TEAM) Process Maturity (PMAT) Step 2) Perform behavioral analyses to determine the effect of factor levels on the quantities to be estimated Once the parameters have been determined; a behavioral analysis should be carried out to understand the effects of each of the parameters on the response variable. For the COCOMO II Post Architecture model, the effects of each of the 22 COCOMO II factors on productivity was analyzed qualitatively. For COQUALMO, the effects of each of the parameters on defect introduction and removal rates by activity were analyzed. One of the factors, Development Flexibility, FLEX, was found to have no 45

58 impact on defect introduction; although it was still included in the Delphi analyses to validate the authors findings. Step 3) Identify the relative significance of the factors on the quantities to be estimated After a thorough study of the behavioral analyses is done, the relative significance of each of the predictor variables on the response variable must be defined. For COCOMO II and COQUALMO, the relative significance of each cost driver on productivity and each quality driver on quality was determined. Step 4) Perform expert-judgment Delphi assessment of quantitative relationships; formulate a-priori version of the model Once step 3 of the modeling methodology is completed, an assessment of the quantitative relationships of the significance of each parameter must be performed. An initial version of the model can then be defined. This version is based on expert-judgment and is not calibrated against actual project data. But it serves as a good starting point as it reflects the knowledge and experience of experts in the field. For COCOMO II and COQUALMO, a 2-round Delphi process (see section and appendix A) was performed to assess the quantitative relationships (derived in Step 3), their potential ranges of variability, and to refine the factor level definitions. The driver, FLEX, was dropped from COQUALMO based on the results of the Delphi showing its insignificance on defect introduction. 46

59 Step 5) Gather project data and determine statistical significance of the various parameters After the initial version of the model is defined, project data needs to be collected to obtain data-determined model parameters. Actual data on Effort, Schedule, Defect Introduction Rates, Defect Removal Fractions and the parameters is being collected to continuously enhance the existing database to improve the calibration of the model Step 6) Determine a Bayesian A-Posteriori set of model parameters. Using the expert-determined a-priori values, determine a Bayesian a-posteriori set of model parameters as a weighted average of the a-priori values and the data-determined values, with the weights determined by the statistical significance of the data-based results (see sections 2.7.1, 3.3 and 4.2.2). Step 7) Gather more data to refine model Continue to gather data, and refine the model to be increasingly data-determined vs. expert-determined. 3.3 The Bayesian Approach In chapter 2, the estimation techniques were classified into six categories. One of them was the Composite category in which the Bayesian approach was given as an example. In this section, a simple bivariate cost model is discussed using the Bayesian framework A Simple Software Cost Estimation Model Software engineering data is usually scarce and incomplete and we are faced with the challenge of making good decisions using this data. Classical statistical techniques 47

60 described in section 2.6 derive conclusions based on the available data. But, to make the best decision it is imperative that in addition to the available sample data we should incorporate nonsample or prior information that is relevant. Usually a lot of good expert judgment based information on software processes and the impact of several parameters on effort, cost, schedule, quality etc. is available. This information doesn t necessarily get derived from statistical investigation and hence classical statistical techniques such as OLS do not incorporate it into the decision making process. The question that we need to answer is: How do we make the best use of relevant prior information in the decision making process? The Bayesian approach is one way of systematically employing sample and nonsample data effectively to derive a cost estimation model. Basic Framework: Terminology and Theory The two main questions that we want to answer using the Bayesian framework are: (i) How do we make reasonable conclusions about a parameter before and after a sample is taken? (ii) How do we statistically combine sample data with prior information? Let us consider a simple economic model for software cost estimation: Effort = A Size B ε Eq. 3.1 where effort is the number of man months (MM) required to develop a software product of size measured in source lines of code (SLOC), and ε is the log-normal error term. The cost of developing the product is estimated by taking the product of effort and labor rate. 48

61 Rewriting this in linear form, we have to take logs which yields ln( Effort) = ln A + B ln( Size) + ln( ε ) Eq. 3.2 i.e. ln( Effort) = A + B ln( Size) + ε Eq where A 1 1 = ln A ε = ln( ε) For all samples t and s, we assume Co var iance( e, e ) = 0, where e t and e s are the errors associated with observations t and s respectively, and Covariance(e t, e s ) is the measure of the degree to which e t and e s are linearly related, Since, the covariance is assumed to be zero, each sample is assumed to be independent of every other sample. To model the above equation, we need to derive the values of B and A 1. To understand how to incorporate prior information along with the collected sample data we must thoroughly understand the modeling concepts in the absence of prior information. The next section illustrates this scenario using a very simple software cost estimation model. Modeling under complete prior uncertainty Consider the hypothetical dataset shown below: Effort Size t s 49

62 Figure 3.2: A Simple Software Cost Model For example, the first observation in the dataset is a software product of size 4500 SLOC (Source Lines of Code) that took 6.1 PM (Person Months, where 1 PM = 152 hours) to develop. Now, let us suppose that we have no prior information about the distributions of A 1 and B i.e. we are completely uncertain about the values of A 1 and B. We believe that both A 1 and B can lie anywhere between - and +. To represent complete ignorance of the probability density of A 1 and B, we write the prior density functions as ƒ( A 1 ) = 1 for - < A 1 < + and ƒ(b) = 1 for - < B < + Eq. 3.4 f(a 1 ) Figure 3.3: Prior Density Functions f(b) f(a 1 ) =1 f(b) =1 0 A 1 0 B 50

63 Linear regression using the above sample data gives the following results: Data set = Hypothetical Response = log[effort] Predictors = log[size] Coefficient Estimates Label Estimate Std. Error t-value Constant log[size] R Squared: Sigma hat: Number of cases: 5 Degrees of freedom: 3 Summary Analysis of Variance Table Source df SS MS F p-value Regression Residual This derives an economic model of software estimation that can be formulated as: ln( Effort) = ln( Size) Eq or Effort = 14. Size.99, where 1.4 = e 0.33 The estimate for 0.33 cannot be used as a reliable estimate as we do not have data in the region where ln(size) = 0, i.e. Size = 1000 SLOC or 1 ksloc. But, let us nevertheless explore its interpretation. The above point estimates for A 1 and B can be used to construct interval estimates using their standard errors. Using the t-distribution, the appropriate critical value, t c for three degrees of freedom and a 95% confidence interval is large as (3.182)(0.19) < B < 0.99+(3.182)(0.19) 0.39 < B < 1.58 Eq. 3.6 The interval suggests that the exponent for Size could be as small as 0.39 or as 51

64 A lot of studies in the software estimation domain have shown that software exhibits diseconomies of scale [Banker, 1994, Gulledge, 1993]. In the simple model presented above, the exponential factor accounts for the relative economies or diseconomies of scale encountered in different size software projects. The exponent, B, is used to capture these effects. If B < 1.0, the project exhibits economies of scale. If the product s size is doubled, the project effort is less than doubled. The project s productivity increases as the product size is increased. Some project economies of scale can be achieved via project-specific tools (e.g., simulations, testbeds) but in general these are difficult to achieve. For small projects, fixed start-up costs such as tool tailoring and setup of standards and administrative reports are often a source of economies of scale. If B = 1.0, the economies and diseconomies of scale are in balance. This linear model is rarely found in software economics literature. If B > 1.0, the project exhibits diseconomies of scale. This is generally due to two main factors: growth of interpersonal communications overhead and growth of largesystem integration overhead. Larger projects will have more personnel, and thus more interpersonal communications paths consuming overhead. Integrating a small product as part of a larger product requires not only the effort to develop the small product, but also the additional overhead effort to design, maintain, integrate, and test its interfaces with the remainder of the product. The data analysis on the original COCOMO indicated that its projects exhibited net diseconomies of scale. The projects factored into three classes or modes of software 52

65 development (Organic, Semidetached, and Embedded), whose exponents B were 1.05, 1.12, and 1.20, respectively. The data analysis on the most recent COCOMO II.1999 database indicates that some projects exhibit economies of sale. Although, there are very few projects that exhibit this economies of scale in the database. Based on the above observations, the model derived from linear regression is unsatisfactory, especially the 95% confidence region for B lying between 0.39 and close to 1.0 i.e. most of the region where B<1. This could be due to two reasons: (i) the estimate is not accurate or reliable (ii) there is sampling error and B is indeed > 1 or maybe > 0.95 or some value closer to 1 rather than a value as low as We can also determine the 95% confidence region for A 1 as 0.33-(3.182)(0.12) < A 1 < 0.33+(3.182)(0.12) i.e < A 1 < 0.71 Eq. 3.7 Although, it should be noted that the estimate and range of A 1 is an approximation due to lack of data in the region where ln(size) = 0. A 1 is only used to help determine the position of the line determined by the model. It is an important parameter for estimation but should not be analyzed to give any economic interpretation. Figure 3.4: Post Sample Density Functions: Modeling under complete prior uncertainty f (A 1 / ln(effort) ) f ( B / ln(effort) )

66 Taking antilogs, we get the range of A as 0.95 < A < 2.03 Eq. 3.8 Summarizing the above, our simple post-sample software cost estimation model (in the absence of prior information) looks like Effort = 14. Size 0.99 where 0.39 < B < 1.58 and 0.95 < A < 2.03 Eq. 3.9 Equation 3.9 violates our belief that software exhibits diseconomies of scale most of the time. This disbelief is an indication that there was some prior information, namely the belief of diseconomies of scale for software (i.e. B > 1.0 or at least B > 0.95 or B > 0.9), which was not specified implying that prior to sampling, we were not completely uncertain of the value of B. Modeling with the Inclusion of Prior Information As described above, we are not completely uncertain about the probability distributions of B. We know that all values of B in the range of - < B < + are not equally likely. In fact, we know much more than that. The uniform prior density functions used above are incomplete. We need a way of incorporating our current nonsample knowledge into our prior density functions so that the resulting model is more indicative of our experience. Let us assume for the purposes of this example that B > 1 and answer the following questions: (i) How do we include our prior information of B in terms of an apriori density function? (ii) How do we combine the nonsample prior information with our observed data? (iii) How do we determine estimates for the combined information? 54

67 The first question we need to answer is: How do we include our prior information of B in terms of an apriori density function? If we believe that B > 1, but we do not know where exactly B lies then all values of B >1 are equally likely. We need a probability function that appropriately models that all values of B > 1 are equally likely. The following function is a reasonable one with this property. ƒ(b) = 1 if B > 1 and = 0 if B 1 Eq The nonsample prior density function of B represented in equation 3.10 is depicted below Figure 3.5: Prior density function of B B f(b) = 1 f(b) = 0 1 B One can argue that the above prior density function is not very accurate. We know that P(1<B<5) is higher than P(5<B<10). Other more specific probability density function can be used to include this information. However, for now, we will assume that P(1<B<5)=P(5<B<10). The next question we need to answer is: How do we combine the nonsample prior information with our observed data? 55

68 Modeling under complete prior uncertainty resulted in a point estimate of B = 0.99 with a 95% confidence region of 0.39 < B < 1.58 i.e. PN 1 (0.39 < B < 1.58) = The probability that B is less than 1 (shaded region in figure 3.6) is P(B<1) = P(z 1 <1-0.99/0.39) = P(z 1 <0.026) = 0.51 Eq Figure 3.6: Post Sample Density Functions (Modeling with the Inclusion of Prior Information) f N(B/ln(Effort)) f (B/In (Effort) N 51% of area under normal curve f f (B/In TN(B/ln(Effort)) (Effort) T N f(b/in (Effort) ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ If our prior information attaches ƒ(b) = 0 if B < 1; then our post-sample model should also include this information. Our next step is to see how we should include ƒ(b) = 0 if B < 1 to the model Effort 0 = 14. Size.99 where 0.39 < B < 1.58 and 0.95 < A < 2.03 Eq From now on the subscript, N, will be used to denote Normal distribution in modeling with complete uncertainty and the subscript TN is used to denote Truncated Normal distribution in modeling with prior nonsample information. 56

69 We need to truncate the normal post-sample density function shown in the chart above to exclude the part that includes B<1. This means that we take the probability mass to the left of the curve at B=1 and we distribute it proportionally across the rest of the curve. The resulting probability density function is called the truncated normal distribution. It is depicted in figure 3.6 along with the probability density function of complete uncertainty. The truncated probability density function is the post-sample density function. The P(B<1) = 0 in this post-sample probability density function and this is consistent with our prior economic principles. The third question we need to answer is: How do we determine estimates for the combined information? To determine the point estimate of B, we use a computer generated sample of 10,000 datapoints with mean = 0.99 and standard deviation = We discard those observations that have B < 1. From the 10,000 observations, 5,240 observations have B > 1. The mean of the remaining random 4,760 observations is 1.15 and the standard deviation is This is an estimate of the mean and standard deviation of post sample probability density function of B. Thus, the a-posteriori point estimate of B is Summarizing, we have Total number of observations randomly generated: 10,000 Number of observations with B < 1: 5,240 Number of observations with B > 1: 4,760 P N (B < 1) = 5,240/10,000 = This is very close to the probability computed above i.e Note that as the sample size grows bigger the P N (B < 1) approaches

70 Thus, the point estimate of B after the prior information and the sampling data information have been combined is 1.15 with a standard deviation of As described above the Bayesian approach can be diagramatically summarized as Figure 3.7: The Bayesian Approach A - Priori Information Sampling Data A - Posteriori Model A bivariate normal distribution model has been presented in this section. For a general software cost estimation model with more than two parameters, the bivariate distribution can be extended to a multivariate distribution. This approach is used for the Bayesian calibration of COCOMO II described further in chapter 4, section The COnstructive QUALity MOdel (COQUALMO) Framework Cost, schedule and quality are highly correlated factors in software development. They basically form three sides of the same triangle. Beyond a certain point (the Quality is Free point), it is difficult to increase the quality without increasing either the cost or schedule or both for the software under development. Similarly, development schedule cannot be drastically compressed without hampering the quality of the software product and/or increasing the cost of development. Software estimation models can (and should) play important roles in facilitating the balance of cost/schedule and quality. Recognizing this important association, an attempt is being made to develop a quality model extension to COCOMO II; namely COQUALMO. The many benefits of cost/quality modeling (and hence COQUALMO) are: 58

71 Resource allocation: The primary but not the only important use of software estimation is budgeting for the development life cycle. Tradeoff and risk analysis: An important capability is to enable what-if analyses that demonstrate the impact of various defect removal techniques and the effects of personnel, project, product and platform characteristics on software quality. A related capability is to illuminate the cost/schedule/quality trade-offs and sensitivities of software project decisions such as scoping, staffing, tools, reuse, etc. Time to Market initiatives: An important additional capability is to provide cost/schedule/quality planning and control by providing breakdowns by component, stage and activity to facilitate the Time To Market initiatives. Software quality improvement investment analysis: A very important capability is to estimate the costs and defect densities and assess the return on investment of quality initiatives such as use of mature tools, peer reviews and disciplined methods. This section presents the software defect introduction and removal framework used for the development of COQUALMO The Software Defect Introduction and Removal Model The Quality model is an extension of the existing COCOMO II [Boehm, 1995, CSE, 1997] model. It is based on The Software Defect Introduction and Removal Model described by Barry Boehm in [Boehm, 1981] which is analogous to the tank and pipe model introduced by Capers Jones [Jones, 1975] and illustrated in figure

72 Figure 3.8: The Software Defect Introduction and Removal Model Residual Software Defects Defect Introduction pipes Code Defects Design Defects Requirements Defects Defect Removal pipes Figure 3.8 shows that defects conceptually flow into a holding tank through various defect source pipes. These defect source pipes are modeled in COQUALMO as the Software Defect Introduction Model. The figure also depicts that defects are drained off through various defect elimination pipes, which in the context of COQUALMO are modeled as the "Software Defect Removal Model." Each of these two submodels is discussed in further detail in chapter 4, section 4.3 Kan proposed a very similar approach to defect modeling for each process step. This approach is illustrated in figure 3.9 and is adapted from [Kan, 1996]. Figure 3.9: Defect Injection and Removal of a Process Step [Kan, 1996] Undetected defects Defects existing on Defects injected Step Entry during development Defect detection Defect repair Defects removed Defects existing after the step exit 60

73 Kan also defined a defect matrix by cross-classifying defect data in terms of the development phase in which the defects are found (and removed) and the phases in which the defects are injected. We use a similar matrix approach for our data-reporting scheme as shown later in this section (table 3.2). Researchers from the Software Productivity Consortium have also made use of a similar phase-based modeling approach to capture the effects of defect injection and removal [Gaffney, 1988, Gaffney, 1990]. Remus and Zilles also used a similar concept and developed the Remus and Zilles defect-removal model [Remus, 1979]. The model assumes that an original set of defects (OD) is introduced when the first version of the software is written. OD is dependent on the size of the software product. Other model inputs include: MP, which is the number of major problems found during reviews and inspections. PTM, which is the number of defects discovered during testing. The model assumes that all detected defects are removed but could introduce bad fixes. The model requires an estimate of the proportion of bad fixes, ε i, and the fraction of defects, P i, detected at each stage. The model predicts the total defects in the product, TD as TD = MPµ /( µ 1 ) Eq where µ = MP/PTM. 61

74 Unfortunately, Remus and Zilles provided no form of evaluation for their model and hence the model didn t gain much popularity. For the purposes of COQUALMO, defects are classified based on their origin as Requirements Defects (e.g. leaving out a required Cancel option in an Input screen), Design Defects (e.g. error in the algorithm), Coding Defects (e.g. looping 9 instead of 10 times). Defects are further classified based on their severity 2 : Critical Causes a system crash or unrecoverable data loss or jeopardizes personnel. The product is unusable (and in mission/safety software would prevent the completion of the mission). High Causes impairment of critical system functions and no workaround solution exists. Some aspects of the product do not work (and the defect adversely affects successful completion of mission in mission/safety software), but some attributes do work in its current situation. Medium Causes impairment of critical system function, though a workaround solution does exist. 2 Adapted from IEEE Std

75 The product can be used, but a workaround (from a customer s preferred method of operation) must be used to achieve some capabilities. The presence of medium priority defects usually degrades the work. Low Causes a low level of inconvenience or annoyance. The product meets its requirements and can be used with just a little inconvenience. Typos in displays such as spelling, punctuation, and grammatical errors that do not generally cause operational problems are usually categorized as low severity. None Concerns a duplicate or completely trivial problem, such as a minor typo in supporting documentation. Critical and High severity defects result in an approved change request or failure report. COQUALMO accounts for only the Critical, High and Medium severity defects. The overall data-reporting scheme for the model is shown in table 3.2. Table 3.2 gives the overall picture of the data that needs to be collected to model COQUALMO. Since, only critical, high and medium severity defects are accounted for the table is kept simple by aggregating these defects. To understand the table, lets consider the first cell marked 50/30/.2-50 Requirements Defects were introduced in the Requirements activity; of which 30 were resolved in the Requirements activity; and the average cost to resolve each Requirements defect is.2 units (for example, 0.2 person hours). Now, lets consider the second cell 63

76 marked 20+20/20/.5-20 new Requirements defects were discovered in the Design activity and 20 (50-30) were carried over from the Requirements activity; of which 20 were resolved in the Design activity; and the average cost to resolve each Requirements defect is.5 units (for example, 0.2 person hours). Table 3.2: Data Reporting Scheme Type of Artifact Requirements 50/ / /.2 20/.5 Design 55/25/ 1.0 Activity Discovered + Unresolved / Resolved in Activity/Cost To Resolve by Activity Reqts Design Code SW Integ. SW System Post- Other & and Test Acceptance Implementatioational Oper- Unit Test Test and Test /15/ /10/ /6/ /25/ 2.5 Code Note that the cost to resolve a Requirements defect increases as the defect propagates through the several activities of the development process. This is consistent with published results from [Boehm, 1981]. The complete description of COQUALMO is presented in chapter 4, section

77 CHAPTER 4: Research Contributions 4.1 Introduction Chapter 4, the most important chapter of this dissertation, provides details of the Bayesian calibration of COCOMO II in section 4.2 and the formulation of the quality model extension to COCOMO II, namely COQUALMO, in section COCOMO II Calibration Section 4.2 focuses on the calibration of COCOMO II and is divided into five parts. Section describes the 1997 multiple regression calibration approach. Section describes the 1999 Bayesian calibration approach and compares the results obtained by the multiple regression approach. Section provides a brief description of a variation of the prior information used in the Bayesian approach, i.e. the well-defined G- prior approach. Section describes a reduced COCOMO II model i.e. a more parsimonious model with fewer than twenty-two parameters. Section summarizes section The Multiple Regression Calibration Approach As discussed in chapter 2, sections 2.2 and 2.6, most of the existing empirical software engineering cost models are calibrated using the classical multiple regression approach. In this section, the focus is on the overall description of the multiple regression approach and how it can be used on software engineering data. The restrictions imposed by the multiple regression approach and the resulting problems faced by the software engineering community in trying to calibrate empirical models using this approach are 65

78 highlighted. The 1997 calibration of the COCOMO II Post Architecture (see section for an explanation of Post Architecture and [Boehm, 1995]) using the multiple regression approach using the 1997 dataset composed of data from 83 completed projects collected from Commercial, Aerospace, Government and non-profit organizations (see Appendix B for the data collection form used) is described. Multiple Regression expresses the response (e.g. Person Months) as a linear function of k predictors (e.g. Source Lines of Code, Product Complexity, etc.). This linear function is estimated from the data using the ordinary least squares approach discussed in numerous books such as [Judge, 1993, Weisberg, 1985]. A multiple regression model is shown: yt = β0 + β1xt βkxtk + εt Eq. 4.1 where x t1 x tk are the values of the predictor (or regressor) variables for the t th observation, β 0... β κ are the coefficients to be estimated, ε t is the usual error term, and y t is the response variable for the t th observation. The COCOMO II Post Architecture model has the following mathematical form: 17 B+ SFi Effort = A Size i 1 EM [ ] 5 = i = 1 i Eq. 4.2 where, A = Baseline multiplicative calibration constant B = Baseline Exponential calibration constant, set at 1.01 for the 1997 calibration. 66

79 Size = Size of the software project measured in terms of ksloc (thousands of Source Lines of Code [Park, 1992] or Function Points [IFPUG, 1994] and Programming Language. SF = Scale Factor EM = Effort Multiplier The five scale factors and 17 effort multipliers of the COCOMO II Post Architecture model are shown in table 4.1. All these twenty-two variables are measured qualitatively by selecting a rating from a well-defined rating scale. The rating options are: Very Low (VL), Low (L), Nominal (N), High (H), Very High (VH) and Extra High (XH). Intermediate (quarter-way, half-way, three-fourths-way) ratings between any two predefined ratings can also be selected. Symbol Abbreviation Name SF 1 PREC Precendentedness SF 2 FLEX Development Flexibility SF 3 RESL Architecture and Risk Resolution Table 4.1: COCOMO II Post Architecture Parameters Description This captures the organization s understanding of product objectives. If the understanding and experience is very low then a Low rating is assigned and if it is high the a High rating is assigned. The expresses the degree of conformance to software requirements and external interface standards. Full compliance is a Low rating and low compliance is a High rating. This rates the understanding and understanding of the product software architecture and the number / criticality of risk items. Full resolution is a High rating and little resolution is a low rating. SF 4 TEAM Team cohesion This captures the consistency of stakeholder objectives and the willingness of all parties to work together as a team. Difficult interactions get a Low rating and cooperative interactions receive a High rating. SF 5 PMAT Process Maturity This is the maturity of the software process used to produce the product. The criteria are directed related to the Capability Maturity Model. The model has five levels, one (lowest) to five (highest). A Low PMAT rating is for CMM level-one organization. A High PMAT rating is for a CMM level-five organization. 67

80 Symbol Abbreviation Name EM 1 RELY Required Software Reliability Description This is the measure of the extent to which the software must perform its intended function over a period of time. If the effect of a software failure is only slight inconvenience, then RELY is low. If a failure would risk human life then RELY is very high. EM 2 DATA Data Base Size This measure attempts to capture the effect large data requirements have on product development. The reason the size of the database is important to consider it because of the effort required to generate the test data that will be used to exercise the program. EM 3 RUSE Required Reusability EM 4 DOCU Documentatio n Match to Life-cycle Needs EM 5 CPLX Product Complexity EM 6 TIME Time Constraint EM 7 STOR Storage Constraint EM 8 PVOL Platform Volatility EM 9 ACAP Analyst Capability This cost driver accounts for the additional effort needed to construct components intended for reuse on the current or future projects. This effort is consumed with creating more generic design of software, more elaborate documentation, and more extensive testing to ensure components are ready for use in other applications. The rating scale for the DOCU cost driver is evaluated in terms of the suitability of the project s documentation to its life-cycle needs. The rating scale goes from Very Low (many life-cycle needs uncovered) to Very High (very excessive for life-cycle needs). Complexity is divided into five areas: control operations, computational operations, device-dependent operations, data management operations, and user interface management operations. Select the area or combination of areas that characterize the product or a sub-system of the product. The complexity rating is the subjective weighted average of these areas. Execution Time Constraint. This is a measure of the execution time constraint imposed upon a software system. The rating ranges from nominal (less than 50%) of the execution time resource used, to extra high (95%) of the execution time resource is consumed. Main Storage Constraint. This rating represents the degree of main storage constraint imposed on a software system or subsystem. The rating ranges from nominal (less that 50%) to extra high (95%). Platform is used here to mean the complex of hardware and software (OS, DBMS, etc.) the software product calls on to perform its tasks. The platform includes any compilers or assemblers supporting the development of the software system. This rating ranges from low, where there is a major change every 12 months, to very high, where there is a major change every two weeks. Analyst Capability. Analysts are personnel that work on requirements, high level design and detailed design. The major attributes that should be considered in this rating are Analysis and Design ability, efficiency and thoroughness, and the ability to communicate and cooperate. The rating should not consider the level of experience of the analyst; that is rated with AEXP, PEXP and LTEX. Analysts that fall in the 15th percentile are rated very low and those that fall in the 95th percentile are rated as very high. 68

81 Symbol Abbreviation Name EM 10 PCAP Programmer Capability EM 11 PCON Personnel Continuity EM 12 AEXP Applications Experience EM 13 PEXP Platform Experience EM 14 LTEX Language and Tool Experience EM 15 TOOL Use of Software Tools EM 16 SITE Multi-Site Development EM 17 SCED Required Development Schedule Description Programmer Capability. Evaluation should be based on the capability of the programmers as a team rather than as individuals. Major factors which should be considered in the rating are ability, efficiency and thoroughness, and the ability to communicate and cooperate. The experience of the programmer should not be considered here; it is rated with AEXP, PEXP and LTEX. A very low rated programmer team is in the 15th percentile and a very high rated programmer team is in the 95th percentile. The rating scale for PCON is in terms of the project s annual personnel turnover: from 3%, very high, to 48%, very low. This rating is dependent on the level of applications experience of the project team developing the software system or subsystem. The ratings are defined in terms of the project team s equivalent level of experience with this type of application. A very low rating is for application experience of less than 2 months. A very high rating is for experience of 6 years or more. The Post-Architecture model broadens the productivity influence of PEXP, recognizing the importance of understanding the use of more powerful platforms, including more graphic user interface, database, networking, and distributed middleware. This is a measure of the level of programming language and software tool experience of the project team. A low rating given for experience of less than 2 months. A very high rating is given for experience of 6 or more years. The tool rating ranges from simple edit and code, very low, to integrated lifecycle management tools, very high. Determining this rating involves the assessment and averaging of two factors: site collocation (fully collocated to international distribution) and communication support (surface mail and some phone access to full interactive multimedia). This rating measures the schedule constraint imposed on the project team developing the software. The ratings are defined in terms of the percentage of schedule stretch-out or acceleration with respect to a nominal schedule for a project requiring a given amount of effort. Accelerated schedules tend to produce more effort in the later phases of development because more issues are left to be determined due to lack of time to resolve them earlier. A schedule compression of 75% is rated very low. A stretch-out of a schedule produces more effort in the earlier phases of development where there is more time for thorough planning, specification and validation. A stretch-out of 160% is rated very high The cost drivers had apriori values assigned to each of ratings for the 1997 calibration that were consistent with the results of several published studies and were 69

82 based on the expert-judgment of the researchers of the COCOMO II team. Not all six rating levels were valid for all cost drivers. The values are shown in table 4.2. Table 4.2: COCOMO II.1997 A-Priori Values Driver Symbol VL L N H VH XH PREC SF FLEX SF RESL SF TEAM SF PMAT SF RELY EM DATA EM RUSE EM DOCU EM CPLX EM TIME EM STOR EM PVOL EM ACAP EM PCAP EM PCON EM AEXP EM PEXP EM LTEX EM TOOL EM SITE EM SCED EM A = 2.5 B =

83 We can linearize it by taking logarithms on both sides of the equation as shown: ln( PM ) = β0 ln( A) + β1 B ln( Size ) + β2 SF1 ln( Size ) + + β6 SF5 ln( Size ) + Eq. 4.3 β ln( EM ) + β ln( EM ) + + β ln( EM ) + β ln( EM ) Using the linear model equation and the 1997 COCOMO II dataset consisting of 83 completed projects; the multiple regression approach [Chulani, 1997B, Chulani, 1998] was employed. Because some of the predictor variables had high correlations they were aggregated into new parameters. These included, Analyst Capability and Programmer Capability which were aggregated into Personnel Capability, PERS, and Time Constraints and Storage Constraints which were aggregated into Resource Constraints, RCON. The next highest correlation was between Precedentedness, PREC, and Development Flexibility, FLEX, that had a value of It was decided not to combine these two variables as the threshold value of 0.65 was used for high correlation. Table 4.3 shows the highly correlated parameters that were aggregated for the 1997 calibration of COCOMO II. Table 4.3: COCOMO II.1997 Highly Correlated Parameters TIME STOR ACAP PCAP New Parameter TIME RCON STOR ACAP PCAP PERS TIME (Timing Constraints) ACAP (Analyst Capability) STOR (Storage Constraints) PCAP (Programmer Capability) RCON (Resource Constraints) PERS (Personnel Capability) 71

84 The regression estimated the β coefficients associated with the scale factors and effort multipliers as shown in the RCode (statistical software developed at University of Minnesota, [Cook, 1994]) run. Data set = COCOMOII.1997 Response = log[pm] *log[SIZE] Coefficient Estimates Label Estimate Std. Error t-value Constant_A PREC*log[SIZE] FLEX*log[SIZE] RESL*log[SIZE] TEAM*log[SIZE] PMAT*log[SIZE] log[rely] log[data] log[ruse] log[docu] log[cplx] log[rcon] log[pvol] log[pers] log[pcon] log[aexp] log[pexp] log[ltex] log[tool] log[site] log[sced] As the results indicate, some of the regression estimates had counter intuitive values i.e. negative coefficients (shown in bold). Note also that the coefficient associated with log[size] was forced to stay at 1.01 in the 1997 calibration due to the belief that software exhibits diseconomies of scale. This restriction was removed for the 1999 calibration due to concerns the users of COCOMO II.1997 expressed. As an example, consider the Develop for Reuse (RUSE) effort multiplier. This multiplicative parameter captures the additional effort required to develop components 72

85 intended for reuse on current or future projects. As shown in table 4.4a, if the RUSE rating is Extra High (XH), i.e. developing for reuse across multiple product lines, it will cause an increase in effort by a factor of On the other hand, if the RUSE rating is Low (L), i.e. developing with no consideration of future reuse, it will cause effort to decrease by a factor of This rationale is consistent with the results of twelve published studies of the relative cost of developing for reuse compiled in [Poulin, 1997] and was based on the expert-judgment of the researchers of the COCOMO II team. But, the regression results produced a negative coefficient for the β coefficient associated with RUSE. This negative coefficient results in the counter intuitive rating scale shown in table 4.4b, i.e. an XH rating for RUSE causes a decrease in effort and a L rating causes an increase in effort. Note the opposite trends followed in tables 4.4a and 4.4b. Table 4.4a: Develop for Reuse (RUSE) Expert-determined a-priori rating scale Develop for Reuse L N H VH XH Definition None Across project Across program Across product line Across multiple product lines 1997 A-priori Values Table 4.4b: Develop for Reuse (RUSE) Data-determined rating scale Develop for Reuse L N H VH XH Definition None Across project Across program Across product line Across multiple product lines 1997 Data- Determined Values A possible explanation (discussed in a study by [Mullet, 1976] on "Why regression coefficients have the wrong sign") for this contradiction may be the lack of 73

86 dispersion in the responses associated with RUSE. A possible reason for this lack of dispersion is that RUSE is a relatively new cost factor and our follow-up indicated that the respondents did not have enough information to report its rating accurately during the data collection process. Additionally, many of the responses "I don t know" and "It does not apply" had to be coded as 1.0 (since this is the only way to code no impact on effort). Note (see figure 4.1 on the following page) that with slightly more than 50 of the 83 datapoints for RUSE being set at Nominal and with no observations at XH, the data for RUSE does not exhibit enough dispersion along the entire range of possible values for RUSE. While this is the familiar errors-in-variables problem, the COCOMO II data doesn t allow the resolution of this difficulty. Thus, the assumption to assume that the random variation in the responses for RUSE is small compared to the range of RUSE was made. The reader should note that all other cost models that use the multiple regression approach rarely explicitly state this assumption, even though it is implicitly assumed. Other reasons for the counter intuitive results include the violation of some of the assumptions imposed by multiple regression [Briand, 1992]: Figure 4.1: Distribution of RUSE 74

87 (i) The number of datapoints should be large relative to the number of model parameters (i.e. there are many degrees of freedom). Unfortunately, collecting data has and continues to be one of the biggest challenges in the software estimation field. This is caused primarily by immature processes and management reluctance to release costrelated data. (ii) There should be no extreme cases (i.e. outliers). Extreme cases can distort parameter estimates and such cases frequently occur in software engineering data due to the lack of precision in the data collection process. (iii) The predictor variables (cost drivers and scale factors) should not be highly correlated. Unfortunately, because cost data is historically rather than experimentally collected, correlations among the predictor variables are unavoidable. The above restrictions are violated to some extent by the COCOMO II dataset. The COCOMO II calibration approach determines the coefficients for the five scale factors and the 17 effort multipliers (merged into fifteen due to high correlation as discussed above). Considering the rule of thumb, that every parameter being calibrated should have at least five datapoints, requires that the COCOMO II dataset have data on at least 110 (or 100 if we consider that parameters are merged) completed projects. Note that the COCOMO II.1997 dataset had just 83 datapoints. The second point above indicates that due to the imprecision in the data collection process, outliers can occur causing problems in the calibration. For example, if a particular organization had extraordinary documentation requirements imposed by the management, then even a very small project would require a lot of effort that is expended 75

88 in trying to meet the excessive documentation match to the life cycle needs. If the data collected simply used the highest DOCU rating provided in the model, then the huge amount of effort due to the stringent documentation needs would be underrepresented and the project would have the potential of being an outlier. Outliers in software engineering data, as indicated above, are mostly due to imprecision in the data collection process. The third restriction imposed requires that no parameters be highly correlated. As described above, in the COCOMO II.1997 calibration, a few parameters were aggregated to alleviate this problem. To resolve some of the counter intuitive results produced by the regression analysis (e.g. the negative coefficient for RUSE as explained above), a weighted average of the expert-judgement results and the regression results, with only 10% of the weight going to the regression results for all the parameters was used. The 10% weighting factor was selected because models with 40% and 25% weighting factors produced less accurate predictions. This pragmatic calibrating procedure moved the model parameters in the direction suggested by the sample data but retained the rationale contained within the apriori values. An example of the 10% application using the RUSE effort multiplier is given in figure 4.2. As shown in the graph, the trends followed by the a-priori and the data-determined curves are opposite. The data-determined curve has a negative slope and as shown above in table 4.4, violates expert opinion. If we believe the data, then in practice, the full data-determined model will lose credibility with users. 76

89 Figure 4.2: Example of the 10% weighted average approach: RUSE Rating Scale A-priori expert opinion L N H VH XH 10% weighted average Counter intuitive data-determined Using 10% of the data-driven and 90% of the apriori values, the COCOMO II.1997 calibrated values are determined as shown in table 4.5. The baseline calibration constant, A, evaluates to The resulting calibration of the COCOMO II model using the 1997 dataset of 83 projects produced estimates within 30% of the actuals 52% of the time for effort. The prediction accuracy improved to 64% when the data was stratified into sets based on the eighteen unique sources of the data [see Kemerer, 1987, Kitchenham, 1984, Jeffery, 1990 for further confirmation of local calibration improving accuracy]. The baseline calibration constant, A, of the COCOMO II equation was recalibrated for each of these sets i.e. a different intercept was computed for each set. The constant value ranged from 1.23 to 3.72 for the eighteen sets and yielded the prediction accuracies shown in table

90 Table 4.5: COCOMO II.1997 Values Driver Symbol VL L N H VH XH PREC SF FLEX SF RESL SF TEAM SF PMAT SF RELY EM DATA EM RUSE EM DOCU EM CPLX EM TIME EM STOR EM PVOL EM ACAP EM PCAP EM PCON EM AEXP EM PEXP EM LTEX EM TOOL EM SITE EM SCED EM A = 2.45 B = 1.01 Table 4.6: Prediction Accuracy of COCOMO II.1997 COCOMO II.1997 Before Stratification by After Stratification by Organization Organization PRED(.20) 46% 49% PRED(.25) 49% 55% PRED(.30) 52% 64% 78

91 While the 10% weighted average procedure produced a workable initial model; it was desirable to have a more formal methodology for combining expert judgement and sample information. Since not all of the parameters behaved like RUSE, it was inappropriate to have the 10% uniform weight assigned to all the parameters. For example, the coefficient for PERS = 0.99; and the data collected on this factor is not as noisy as that collected for RUSE. The factor PERS is well understood and is not as new a concept as RUSE. Hence, more weight should be given to the data-determined value i.e. depending on the variance of the collected data and the apriori confidence for each parameter, a weighted average should be appropriately selected. A Bayesian analysis with an informative prior provides such a framework The Bayesian Calibration Approach Basic Framework - Terminology and Theory 3 Bayesian analysis is a well-defined and rigorous process of inductive reasoning that has been used in many scientific disciplines [Gelman, 1995, Zellner, 1983, Box, 1973]. A distinctive feature of the Bayesian approach is that it permits the investigator to use both sample (data) and prior (expert-judgement) information in a logically consistent manner in making inferences. This is done by using Bayes theorem to produce a postdata or posterior distribution for the model parameters. Using Bayes theorem, prior (or initial) values are transformed to post-data views. This transformation can be viewed as a learning process. The posterior distribution is determined by the variances of the prior and sample information. If the variance of the prior information is smaller than the 3 For ease in readability, this section has been duplicated from section

92 variance of the sampling information, then a higher weight is assigned to the prior information. On the other hand, if the variance of the sample information is smaller than the variance of the prior information, then a higher weight is assigned to the sample information causing the posterior estimate to be closer to the sample information. Using Bayes theorem, the two sources of information can be combined as follows: f ( β Y ) f ( Y β ) f ( β ) = Eq. 4.4 f ( Y ) where ß is the vector of parameters in which we are interested and Y is the vector of sample observations from the joint density function f ( β Y ). In equation 4.4, f ( β Y ) is the posterior density function for ß summarizing all the information about ß, f ( Y β ) is the sample information and is algebraically equivalent to the likelihood function for ß, and f ( β ) is the prior information summarizing the expertjudgement information about ß. Equation 4.4 can be rewritten as: f ( β Y ) l( β Y ) f ( β ) Eq. 4.5 In words, equation 4.5 means: Posterior Sample Prior In the Bayesian analysis context, the prior probabilities are the simple unconditional probabilities to the sample information; while the posterior probabilities are the conditional probabilities given sample and prior information. The Bayesian approach makes use of prior information that is not part of the sample data by providing an optimal combination of the two sources of information. As 80

93 described in many books on Bayesian analysis [Leamer, 1978, Box, 1973], the posterior mean, b**, and variance, Var(b**), are defined as: b 1 = [ X X + H ] X Xb+ H b s 2 s and ** * * * 1 Var( b ) = X X + H 2 s ** * 1 Eq. 4.6 where X is the matrix of predictor variables, s is the variance of the residual for the sample data; and H* and b* are the precision (inverse of variance) and mean of the prior information respectively. From equation 4.6, it is clear that in order to determine the Bayesian posterior mean and variance, the mean and precision of the prior information and the sampling information need to be determined. The next two subsections describe the approach taken to determine the prior information and the sampling information, followed by a subsection on the derivation of the Bayesian a-posteriori model. Prior Information To determine the prior information for the coefficients (i.e. b* and H*) for COCOMO II, I conducted a Delphi exercise (see chapter 2, section 2.3.1) [Helmer, 1966, Boehm, 1981, Shepperd, 1997]. Eight experts from the field of software estimation were asked to independently provide their estimate of the numeric values associated with each COCOMO II cost driver. Roughly half of these participating experts had been lead cost experts for large software development organizations and a few of them were originators 81

94 of other proprietary cost models. All of the participants had at least ten years of industrial software cost estimation experience. Based on the credibility of the participants, we felt very comfortable using the results of the Delphi rounds as the prior information for the purposes of calibrating COCOMO II Supporting experience is provided by [Vicinanza, 1991] where a study showed that estimates made by experts were more accurate than model-determined estimates. However, in [Johnson, 1988] evidence showing the inefficiencies of expert judgment in other domains is highlighted. Once the first round of the Delphi was completed, I summarized the results in terms of the means and the ranges of the responses. These summarized results were quite raw with significant variances caused by misunderstanding of the parameter definitions. In an attempt to improve the accuracy of these results and to attain better consensus among the experts, the results were distributed back to the participants. A better explanation of the behavior of the scale factors was provided since the overall opinion was that they were not well understood. Each of the participants got a second opportunity to independently refine his/her response based on the responses of the rest of the participants in round 1. The results of round 2 for the seventeen effort multipliers were representative of the real world phenomena and the decision was made to use these as the a-prior information. But, for the five scale factors, I conducted a third round and made sure that the participants had a very good understanding of the exponential behavior of these parameters. The results of the third round were used as a-priori information for the five scale factors. Please note that if the variance of the prior for any parameter is zero (i.e. if all the experts responded with the same value for a parameter), then the Bayesian 82

95 approach will not incorporate the information from the sampling data and will completely believe only the expert opinion. This is a restriction imposed by the Bayesian approach and may not be the correct way to model software engineering data, but not surprisingly in the software field, the experts did not all agree on an exact value for any parameter. Every parameter had some non-zero prior variance associated with it. Table 4.7 provides the a-priori set of values for the RUSE parameter, i.e. the Develop for Reuse parameter. As discussed in section 4.2.1, this multiplicative parameter captures the additional effort required to develop components intended for reuse on current or future projects. As shown in table 4, if the RUSE rating is Extra High (XH), i.e. developing for reuse across multiple product lines, it will cause an increase in effort by a factor of On the other hand, if the RUSE rating is Low (L), i.e. developing with no consideration of future reuse, it will cause effort to decrease by a factor of The resulting range of productivity for RUSE is 1.73 (= 1.54/0.89) and the variance computed from the second Delphi round is Comparing the results of table 4.7 with the expertdetermined a-priori rating scale for the 1997 calibration illustrated in table 4.4a, validates the strong consensus of the experts in the Productivity Range of RUSE of ~1.7. Table 4.7: COCOMO II.1999 A-Priori Rating Scale for Develop for Reuse (RUSE) Develop for Productivity Range L N H VH XH Reuse Definition Least Productive Rating/Most Productive Rating None Across project Across program Across product line Across multiple product lines A-Priori Values Mean = 1.73 Variance = The Delphi-determined COCOMO II.1999 A-Priori values for all twenty-two parameters are shown in table

96 Table 4.8: Delphi-Determined COCOMO II.1999 "A-Priori" Values Driver Symbol VL L N H VH XH PREC SF FLEX SF RESL SF TEAM SF PMAT SF RELY EM DATA EM RUSE EM DOCU EM CPLX EM TIME EM STOR EM PVOL EM ACAP EM PCAP EM PCON EM AEXP EM PEXP EM LTEX EM TOOL EM SITE EM SCED EM A = 2.79 B = 0.88 A and B are not determined by the Delphi; but are calibrated using linear regression 84

97 Sample Information The sampling information is the result of a data collection activity initiated in September 1994, soon after the initial development of the COCOMO II description, subsequently published in [Boehm, 1995]. Affiliates of the Center for Software Engineering at the University of Southern California provided most of the data (See Appendix B for the data collection form used). These organizations represent the Commercial, Aerospace, and FFRDC (Federally Funded Research and Development Center) sectors of software development. Data of completed software projects is recorded on a data collection form (Appendix B). A question asked very frequently is the definition of software size, i.e., what defines a line of source code or a Function Point (FP)? Appendix B in the Model Definition Manual [CSE, 1997] defines a logical line of code using the framework described in [Park, 1992] and [IFPUG, 1994] gives details on the counting rules of FPs. In spite of the definitions, the data collected to date exhibits local variations caused by differing interpretations of the counting rules. Another parameter that has different definitions within different organizations is effort i.e. what is a person month (PM)? In COCOMO II, a PM is defined as 152 person hours. But, this varies from organization to organization. This information is usually derived from time cards maintained by employees. But, uncompensated overtime hours are illegal to report in time cards and hence do not get accounted for in the PM count. This leads to variations in the data reported and a lot of caution was taken while collecting the data. Variations also occur in the understanding of the subjective rating scale of the scale factors and effort multipliers 85

98 ([Cuelenaere, 1994] developed a system to alleviate this problem and help users apply cost driver definitions consistently for the PRICE-S model). For example, a very high rating for analyst capability in one organization could be equivalent to a nominal rating in another organization. All these variations suggest that any organization using a parametric cost model should locally calibrate the model to produce better estimates. Please refer to the local calibration results discussed in table 4.6. The sampling information includes data on the response variable, effort in Person Months (PM), where 1 PM = 152 hours, and predictor variables such as actual size of the software in ksloc (thousands of Source Lines of Code adjusted for breakage and reuse). The database has grown from 83 datapoints in 1997 to 161 datapoints in The distributions of effort and size for the 1999 database of 161 datapoints are as shown in figure 4.3. It was important to check the distributions to ensure that the log transformations (taken to go from equation 4.2 to equation 4.3) are indeed valid and they yield the normality essential for multiple regression. Figure 4.3: Distribution of Effort and Size: 1999 dataset of 161 observations As can be noted, both the histograms are positively skewed with the bulk of the projects in the database with effort less than 500 PM and size less than 150 ksloc. 86

99 Since the multiple regression approach based on least squares estimation assumes that the response variable is normally distributed, the positively skewed histogram for effort indicates the need for a transformation. The relationships between the response variable and the predictor variables also need to be linear. The histograms for size in figures 4.3 and 4.4 and the scatter plot in figure 4.5 show that a log transformation is appropriate for size. Furthermore, the log transformations on effort and size are consistent with equations 4.2 and 4.3. Figure 4.4: Distribution of log transformed Effort and Size: 1999 dataset of 161 observations The plot of log(effort) versus log(size) shown in figure 4.5 depicts a strong relationship between the two parameters that is also indicated in the regression run shown on the following page. Figure 4.5: Correlation between log[effort] and log[size] log(size) log(effort) 87

100 The regression analysis done in RCode (developed at University of Minnesota, [Cook, 1994]) on the log transformed COCOMO II parameters using a dataset of 161 datapoints yield the following results: Data set = COCOMOII.1999 Response = log[pm] Coefficient Estimates Label Estimate Std. Error t-value Constant_A log[size] PREC*log[SIZE] FLEX*log[SIZE] RESL*log[SIZE] TEAM*log[SIZE] PMAT*log[SIZE] log[rely] log[data] log[ruse] log[docu] log[cplx] log[time] log[stor] log[pvol] log[acap] log[pcap] log[pcon] log[aexp] log[pexp] log[ltex] log[tool] log[site] log[sced] The above results provide the estimates for the β coefficients associated with each of the predictor variables (see equation 4.3). The t-value (ratio between the estimate and corresponding standard error, where standard error is the square root of the variance) may be interpreted as the signal-to-noise ratio associated with the corresponding predictor 88

101 variables. Hence, the higher the t-value, the stronger the signal (i.e. statistical significance) being sent by the predictor variable. These coefficients can be used to adjust the a-priori Productivity Ranges (PRs) to determine the data-determined PRs for each of the 22 parameters. For example, the data-determined PR for RUSE = (1.73) where 1.73 is the a-priori PR as shown in table 4.4. While the regression produced intuitively reasonable estimates for most of the predictor variables; the negative coefficient estimate for RUSE (as discussed earlier) and the magnitudes for the coefficients on AEXP (Applications Experience), LTEX (Language and Tool Experience), FLEX (Development Flexibility), and TEAM (Team Cohesion), violate the prior opinion about the impact of these parameters on Effort (i.e. PM). The quality of the data probably explains some of the conflicts between the prior information and sample data. Note that compared to the results depicted in section 4.2.1, these regression results (using 161 datapoints) produced better estimates. Only, RUSE has a negative coefficient associated with it compared to PREC, RESL, LTEX, DOCU and RUSE in the regression results using only 83 datapoints. Thus, adding more datapoints (i.e. increasing the degrees of freedom) reduced the problems of counter intuitive results. Combining Prior and Sampling Information: Posterior Bayesian Update As a means of resolving the above conflicts, the Bayesian paradigm was used as a framework of formally combining prior expert judgment with the COCOMO II sample data. 89

102 Equation 4.6 reveals that if the precision of the a-priori information (H*) is bigger (or the variance of the a-priori information is smaller) than the precision (or the variance) of the sampling information (1 s 2 X X ), the posterior values will be closer to the a- priori values. This situation can arise when the gathered data is noisy as depicted in figure 4.6 for an example cost factor, Develop for Reuse. Figure 4.6 illustrates that the degree-of-belief in the prior information is higher than the degree-of-belief in the sample data. As a consequence, a stronger weight is assigned to the prior information causing the posterior mean to be closer to the prior mean. Figure 4.6: A-Posteriori Bayesian Update in the Presence of Noisy Data (Develop for Reuse, RUSE) A-p osteriori Bay esian up date P recision of P rior Inform ation > P recision of Sam pling D ata Inform ation 0.83 Noisy data analy sis A-p riori Exp erts Delp hi Productivity Range = Highest R ating / Lowest R ating On the other hand (not illustrated), if the sampling information (1 s 2 X X ) is more precise than the prior information (H*), then the higher weight assigned to the sampling information causes the posterior mean to be closer to the mean of the sampling data. The resulting posterior precision will always be higher than the a-priori precision or the sample data precision. Note that if the prior variance of any parameter is zero, then the parameter will be completely determined by the prior information. Although, this is a 90

103 restriction imposed by the Bayesian approach, it is of little concern as the situation of complete consensus very rarely arises in the software engineering domain. The complete Bayesian analysis on COCOMO II yields the Productivity Ranges (ratio between the least productive parameter rating, i.e. the highest rating, and the most productive parameter rating, i.e. the lowest rating) illustrated in figure 4.7. Figure 4.7 gives an overall perspective of the relative Software Productivity Ranges (PRs) provided by the COCOMO II.1999 parameters. The PRs provide insight on identifying the high payoff areas to focus on in a software productivity improvement activity. For example, Product Complexity is the highest payoff parameter and Development Flexibility is the lowest payoff parameter in the Bayesian Calibration of the COCOMO II model. Figure 4.7: Bayesian A-Posteriori Productivity Ranges Development Flexibility (FLEX): Team Cohesion (TEAM): Develop for Reuse (RUSE): Precedentedness (PREC): Architecture and Risk Resolution (RESL): Platform Experience (PEXP): Data Base Size (DATA): Required Development Schedule (SCED): Language and Tools Experience (LTEX): Process Maturity (PMAT): Storage Constraint (STOR): Use of Software Tools (TOOL): Platform Volatility (PVOL): Applications Experience (AEXP): Multi-Site Development (SITE): Documentation Match to Life Cycle Needs (DOCU): Required Software Reliability (RELY): Personnel Continuity (PCON): Time Constraint (TIME): Programmer Capability (PCAP): Analyst Capability (ACAP): Product Complexity (CPLX): COCOMO II.1999 Parameter: Associated Variance 91

104 Along each bar in figure 4.7, the variance associated with the parameter is indicated. Hence, even though, the two parameters, Multisite Development (SITE) and Documentation Match to Life Cycle Needs (DOCU) have the same PR, the PR of SITE (variance of 0.007) is predicted with more than five times the certainty than the PR of DOCU (variance of 0.037). The COCOMO II.1999 values are shown in table 4.9. Table 4.9: COCOMO II.1999 Values Driver Symbol VL L N H VH XH PREC SF FLEX SF RESL SF TEAM SF PMAT SF RELY EM DATA EM CPLX EM RUSE EM DOCU EM TIME EM STOR EM PVOL EM ACAP EM PCAP EM PCON EM AEXP EM PEXP EM LTEX EM TOOL EM SITE EM SCED EM A = 2.94 B =

105 COCOMO II.1999 produces estimates within 30% of the actuals 75% of the time for effort. If the model s multiplicative coefficient is calibrated to each of the eighteen major sources of project data, the resulting model (with the coefficient ranging from 1.5 to 4.1) produces estimates within 30% of the actuals 80% of the time. It is therefore recommended that organizations using the model calibrate it using their own data to increase model accuracy and produce a local optimum estimate for similar type projects. Table 4.10: Prediction Accuracies of COCOMO II.1997, A-Priori COCOMO II.1999 and Bayesian A-Posteriori COCOMO II.1999 Before and After Stratification Prediction Accuracy COCOMO II.1997 (1997 dataset of 83 datapoints) A-Priori COCOMO II based on Delphi Results (1999 dataset of 161 datapoints) Bayesian A- Posteriori COCOMO II.1999 (1999 dataset of 161 datapoints) Before After Before After Before After PRED(.20) 46% 49% 48% 54% 63% 70% PRED(.25) 49% 55% 55% 63% 68% 76% PRED(.30) 52% 64% 61% 65% 75% 80% From table 4.10, it is clear that the prediction accuracy of COCOMO II.1999 calibrated using the Bayesian approach on 161 datapoints is better than the prediction accuracy of COCOMO II.1997 calibrated using the 10% weighted-average approach on 83 datapoints and the A-Priori COCOMO II Model which is based on the expert opinion gathered via the Delphi exercise. But, the improvement in accuracy in the 1999 model versus the 1997 could be due to the increase in the number of datapoints (from 83 to 161). Hence, to verify that the Bayesian approach indeed gives better results, I tried the 10% weighted-average approach 93

106 on the dataset of 161 datapoints. This yielded the parameter values shown in table 4.11 and prediction accuracies as shown in table Table 4.11: 10% Weighted-average Regression Values on COCOMO II.1999 Dataset Driver Symbol VL L N H VH XH PREC SF FLEX SF RESL SF TEAM SF PMAT SF RELY EM DATA EM RUSE EM DOCU EM CPLX EM TIME EM STOR EM PVOL EM ACAP EM PCAP EM PCON EM AEXP EM PEXP EM LTEX EM TOOL EM SITE EM SCED EM A = 2.51 B =

107 Table 4.12: Prediction Accuracies Using the 10% Weighted-Average Multiple- Regression Approach and the Bayesian Approach on the1999 dataset of 161 datapoints Prediction Accuracy 10% Weighted-Average Multiple- Regression Approach (1999 dataset of 161 datapoints) PRED(.20) 52% 63% PRED(.25) 61% 68% PRED(.30) 68% 75% Bayesian Approach (1999 dataset of 161 datapoints) From table 4.12, it is clear that the Bayesian approach yields higher accuracies as compared to the 10% weighted-average multiple-regression approach on the COCOMO II.1999 dataset of 161 datapoints. Another approach to verify that the Bayesian approach is indeed more accurate than the multiple regression approach when used on a validation dataset is discussed in the next few paragraphs. 83 datapoints. In this approach, three models A, B and C, are generated using the 1997 dataset of Model A: This is a pure-regression based model calibrated using 83 datapoints. Model B: This is the published COCOMO II.1997 model that uses the 10% weighted-average approach discussed earlier on 83 datapoints. Model C: This is a Bayesian model calibrated using 83 datapoints (Please note that this is not the same as the Bayesian COCOMO II.1999 model which is calibrated using 161 datapoints; although the approach used, i.e. the Bayesian approach described in section 4.2.2, is identical). 95

108 Each of these models is then used to determine prediction accuracy on the 1997 dataset of 83 datapoints (the same dataset used to calibrate the model) and on the 1999 dataset of 161 datapoints. These accuracies are shown in table 4.13 and a discussion based on these results follows the table. Table 4.13: Prediction Accuracies Using the Pure-Regression, the 10% Weighted- Average Multiple-Regression and the Bayesian Based Models Calibrated Using the 1997 dataset of 83 datapoints and Validated Against 83 and 161 datapoints Prediction Calibrated Using 83 datapoints Accuracy Pure-Regression Based Model COCOMO II % Weighted-Average Bayesian Approach Based (Model A) Based Model (Model B) Number of datapoints used to validated Model (Model C) PRED(.20) 49% 31% 46% 54% 41% 54% PRED(.25) 63% 39% 49% 59% 53% 62% PRED(.30) 64% 44% 52% 63% 58% 66% From table 4.13, it is clear that model A yields the highest accuracies on the 83 datapoints. This is true since the same dataset is used to calibrate the model and no other model other than the fully-data-determined model should give better accuracies. But as discussed in section 4.2.1, believing the data completely to determine our estimation model using the 1997 dataset of 83 datapoints results in a model that produces counterintuitive results. Furthermore, when model A is used on the newer 1999 dataset of 161 datapoints, the prediction accuracies are relatively poor; only 44% of the projects are estimated within 30% of the actuals. 96

109 Model B performs better on the 1999 dataset of 161 datapoints and produces estimates that are within 30% of the actuals 63% of the time. But, the Bayesian model (model C) outperforms models A and B giving the highest prediction accuracy on our validation dataset of 161 datapoints. It produces estimates that are within 30% of the actuals 66% of the time. Based on these results, we can conclude that the Bayesian calibrated COCOMO II.1999 model will produce the highest accuracies in estimating newly gathered data. In fact, as shown in table 4.12, the Bayesian-calibrated COCOMO II.1999 performs better than the 10% weighted-average model, when both are calibrated using 161 datapoints and validated on the same dataset of 161 datapoints The Generalized G-Prior Approach This section presents a generalized g-prior approach to calibrating the COCOMO II model. It shows that if the weights assigned to sample estimates versus expert judgement are allowed to vary according to precision, a superior predictive model will result. While there are numerous procedures for assessing prior knowledge about the model parameters (i.e. b* and H* in equation 4.6), the Zellner s g-prior approach was chosen because it does not require the difficult task of specifying the prior covariances for the elements of ß [Zellner, 1983]. The g-prior approach assumes that the prior covariances for ß are equal to those provided by the sample data. In other words, the prior precision matrix in equation 4.6 is given by g H * = X X σ 2 Eq

110 Thus, the mean of the posterior distribution for ß is b ** b gb = g * Eq. 4.8 where b is the vector of the ordinary least squares values and b* is the vector of the values anticipated by the experts. The magnitude of g corresponds to the precision (i.e. relative weight) given to the prior. For example, when g = 0.1, the prior values b* receive a weight of one-ninth while the sample values receive a weight of eight-ninth. This version of the g-prior assumes that the same relative weight is assigned to each parameter. Because the information about the individual effort multipliers and scale factors varies, Zellner s g-prior approach is extended to assign differential weights to the parameters. Keeping in mind that the respondents do not have a uniform understanding of each of the effort multipliers and scale factors, it is only logical to assign different weights to some of the parameters. For example, for the data collected on RUSE, a number of respondents answered Nominal when they actually should have responded, I don t know. Other factors where some difficulties were encountered included AEXP, LTEX, FLEX, and TEAM. As a result, the prior estimates of these five factors are given greater weights than the weights assigned to the prior knowledge of the other factors where more precise data is available. We can show that the mean of the posterior distribution of ß is given by: [ ] 1 b = [ gzx XZ + X X ] gzx XZb+ X Xb ** * Eq

111 where Z is a diagonal matrix with elements z 1 z k, z i 0. The z i s are the differential weights for the effort multipliers and scale factors. When Z is an identity matrix (i.e. z i = 1 i) we have the usual g-prior case. Initially z i was assigned the value 5 for AEXP (Applications Experience), LTEX (Language and Tool Experience), FLEX (Development Flexibility), and TEAM (Team Cohesion) and for RUSE (Develop for Reuse). 4 This caused the prior estimate of these parameters to receive a weight of one-third while the sample estimate received a weight of two-thirds. And, z i = 1 was assigned for the rest of the parameters causing the prior estimate of these parameters to receive a weight of one-ninth while the sample estimate received a weight of eight-ninths. Several different combinations of g and the Z matrix were tried. Using prediction accuracies as the model selection criteria, the best values were: for the g-prior approach: g = 0.1, and for the generalized g-prior approach: g = 0.1, z i = 5 for AEXP, LTEX, FLEX, TEAM, RUSE and z i = 1 for the other cost drivers. The generalized g-prior and ordinary least-squares estimates are reported in table The ordinary least-squares estimates for the five factors highlighted in the above discussion ranged from a negative value for RUSE (counter-intuitive) to values with insufficient magnitude for the other four factors. On the other hand, the generalized g- prior approach yields estimates that support prior expert opinion. 99

112 Table 4.14: Multiple-Regression and Generalized G-prior Estimates Parameter Multiple-Regression Generalized G-prior Constant_A log[size] PREC*log[SIZE] FLEX*log[SIZE] RESL*log[SIZE] TEAM*log[SIZE] PMAT*log[SIZE] log[rely] log[data] log[ruse] log[docu] log[cplx] log[time] log[stor] log[pvol] log[acap] log[pcap] log[pcon] log[aexp] log[pexp] log[ltex] log[tool] log[site] log[sced] Figure 4.8 illustrates how the g-prior approach handles noisy data. Here the prior degree-of-belief receives a weight of one-third while the sample data receives a weight of two-thirds. As a consequence, the posterior mean moves closer to the prior mean than would have otherwise been the case. Remember that z i = 1 for most of the effort 4 The choice of these five parameters was due to the counter-intuitive data-determined results discussed in the last paragraph of section

113 multipliers and scale factors whereas z i = 5 for AEXP and the other four parameters identified above. Figure 4.8: A-Posteriori Generalized g-prior Update in the Presence of Noisy Data (AEXP, Applications Experience) Productivity Range = Highest Rating / Lowest Rating A-priori Noisy data analysis Experts Delphi The prediction accuracies within 30% of the actuals using the 10% weightedaverage multiple regression approach, the Bayesian approach and the G-Prior approaches on the 1999 dataset of 161 datapoints are reported in table Table 4.15: PRED (.30) Using 10% Weighted-Average Multiple Regression, Bayesian Approach and G-Prior Approaches on the 1999 Dataset of 161 Datapoints 10% Weighted- Average Approach Bayesian Approach G-Prior Approach (g=0.10) Generalized G-Prior Approach (g = 0.10 and z i = 1 for all but 5 factors where z i = 5) PRED(.30) 68% 75% 70% 76% The generalized g-prior approach yields a model with the highest prediction accuracy, i.e. within 30% of the actuals 76% of the time although it gave some counterintuitive estimates such as a lower than expected Productivity Range for RUSE. The accuracy of the Bayesian calibrated approach is comparable to that of the g-prior 101

114 approach and since it doesn t yield counter-intuitive coefficient estimates, it results in the best calibration for COCOMO II.1999 with all twenty-two predictor variables, in the sense that it is more likely to do well when a wider dispersion of RUSE ratings becomes available in future data The Reduced Model When calibrating COCOMO II, the three main problems in the data are (i) lack of degrees of freedom, (ii) some highly correlated predictor variables and (iii) measurement error for a few predictor variables. These limitations led to some counter-intuitive results. The posterior Bayesian update discussed in section alleviated these problems by incorporating expert-judgment derived prior information into the calibration process. But, such prior information may not be always available. So, what must one do in the absence of good prior information? One way to address this problem is to reduce over fitting by developing a more parsimonious model. This alleviates the first two problems listed above. Unfortunately, the current COCOMO II data doesn t lend itself to alleviating the third problem of measurement error as discussed in section Consider a reduced model developed by using a backward elimination technique. Data set = COCOMOII.1999 Response = log[pm] Coefficient Estimates Label Estimate Std. Error t-value log[size] PREC.log[SIZE] RESL.log[SIZE] log[pcap] log[rely] log[cplx] log[time] log[pexp]

115 log[data] log[docu] log[pvol] log[tool] log[site] log[sced] The above results have no counter-intuitive estimates for the coefficients associated with the predictor variables. The high t-ratio associated with each of these variables indicates a significant impact by each of the predictor variables. The highest correlation among any two predictor variables is 0.5 and is between RELY and CPLX. Overall, the above results are statistically acceptable. This COCOMO II reduced model gives the accuracy results shown in table Table 4.16: Prediction Accuracies of Reduced COCOMO II.1998 Prediction Accuracy Reduced COCOMO II.1998 PRED(.20) 54% PRED(.25) 64% PRED(.30) 73% These accuracy results are a little worse that the results obtained by the Bayesian A-Posteriori COCOMO II.1999 model but the model is more parsimonious. Although in practice, removing a predictor variable is equivalent to stipulating that variations in this variable have no effect on project effort. When the experts and the behavioral analyses tell us otherwise (e.g. that variations in Team Cohesiveness, Applications Experience, Analyst Capability, and Personnel Continuity do affect project effort), extremely strong evidence is needed to drop a variable. 103

116 4.2.5 Conclusions on COCOMO II Calibration In this section, the inappropriateness of the multiple-regression approach to empirically calibrate software cost estimation models was demonstrated using COCOMO II. A clear need for improving the data collection process and/or the model calibration process was ascertained. A Bayesian framework for the software cost modeling domain was developed. This framework has been successfully used in many other domains and was used to provide better solutions to the problems faced with the multiple regression approach in calibrating COCOMO II. COCOMO II.1999 calibrated using the Bayesian approach yielded accuracies of within 30% of the actuals 75% of the time. To avoid the difficult task of specifying prior covariances, two well-defined approaches (the g-prior and the generalized g-prior) were tried within the Bayesian framework and on the 1999 database of COCOMO II. In addition to giving better prediction accuracies than those determined by the multiple-regression approach, the Bayesian approach resolved a few of the data-determined counter-intuitive results. The generalized g-prior approach gave the best prediction accuracies on the 1999 database of COCOMO II. Although, it did result in some controversial estimates such as the Productivity Range for RUSE being less than 1.2. Unfortunately, given the desire to have a highly data-determined model, this cannot be resolved using the current dataset. To address the problems of over fitting a more parsimonious model was developed by using the backward elimination technique. Although, the reduced model provided reasonable accuracies when compared to the Bayesian A-Posteriori COCOMO II.1999; it was not the best calibrated version since predictor variables believed to be significant were not included in the model. In practice, 104

117 removing a predictor variable is equivalent to stipulating that variations in this variable have no effect on project effort. Based on these results, the current most-likely-to-succeed model is COCOMO II.1999, which gives reasonably good accuracies and is an experientially sound model. 4.3 COnstructive QUALity MOdel (COQUALMO) Section 4.3 describes the formulation of COQUALMO based on the defect introduction and defect removal model framework provided in section 3.4. It is divided into six parts, each describing the defect introduction submodel, the defect removal submodel, a proposed rosetta-stone that can be used to map the DI sub-model to the DR sub-model to simplify the data-collection process, an independent validation study, the integrated COCOMO II-COQUALMO framework and some conclusions on the quality model The Software Defect Introduction (DI) Sub-Model Defects can be introduced in several activities of the software development life cycle. As discussed in section 3.4, for the purposes of COQUALMO, defects are classified based on their origin as Requirements Defects (e.g. leaving out a required Cancel option in an Input screen), Design Defects (e.g. error in the algorithm), Coding Defects (e.g. looping 9 instead of 10 times). The DI model s inputs include Source Lines of Code and/or Function Points as the sizing parameter, adjusted for both reuse and breakage, and a set of 21 multiplicative DI-drivers divided into four categories, platform, product, personnel and project, as summarized in table These 21 DI-drivers are a subset of the 22 cost parameters 105

118 required as input for COCOMO II. The decision to use these drivers was taken after an extensive literature search and some behavioral analyses on factors affecting defect introduction was done (this led to the dropping of the Development Flexibility (FLEX) factor as having little evident effect on DI). An example DI-driver and the behavioral analysis are shown in table 4.18 (the numbers in table 4.18 will be discussed later in this section). The choice of using COCOMO II drivers not only makes it relatively straightforward to integrate COQUALMO with COCOMO II but also simplifies the data collection activity which has already been set up for COCOMO II. Figure 4.9: The Defect Introduction Sub-Model of COQUALMO Software Size estimate Software platform, product, personnel and project attributes Defect Introduction Sub-Model Number of non-trivial requirements, design and coding defects introduced The DI model s output is the predicted number of non-trivial requirements, design and coding defects introduced in the development life cycle, where non-trivial defects include: Critical (causes a system crash or unrecoverable data loss or jeopardizes personnel) 106

119 High (causes impairment of critical system functions and no workaround solution exists) Medium (causes impairment of critical system function, though a workaround solution does exist). Note that trivial defects (listed below) are not accounted for in COQUALMO. Low (causes minor inconvenience or annoyance) None (none of the above; or concerns an enhancement rather than a defect). This classification is adapted from IEEE Std Category Platform Product Personnel Project Table 4.17: Defect Introduction Drivers Post-Architecture Model Required Software Reliability (RELY), Data Base Size (DATA), Required Reusability (RUSE), Documentation Match to Life-Cycle Needs (DOCU), Product Complexity (CPLX) Execution Time Constraint (TIME), Main Storage Constraint (STOR), Platform Volatility (PVOL) Analyst Capability (ACAP), Programmer Capability (PCAP), Personnel Continuity (PCON), Applications Experience (AEXP), Platform Experience (PEXP), Language and Tool Experience (LTEX) Use of Software Tools (TOOL), Multisite Development (SITE), Required Development Schedule (SCED), Precedentedness (PREC), Architecture/Risk Resolution (RESL), Team Cohesion (TEAM), Process Maturity (PMAT) 107

120 Table 4.18: Programmer Capability (PCAP) Differences in Defect Introduction PCAP level Reqts. Design Code VH Fewer Design defects Fewer Coding defects due N/A 1.0 due to easy interaction with analysts Fewer defects introduced in fixing defects 0.85 to fewer detailed design reworks, conceptual misunderstandings, coding mistakes 0.76 Nominal 1.0 VL More Design defects due More Coding defects due to N/A 1.0 to less easy interaction with analysts More defects introduced in fixing defects 1.17 more detailed design reworks, conceptual misunderstandings, coding mistakes 1.32 Initial DI Range Range - Round Median - Round Range - Round Final DI Range (Median - Round 2) PCAP Very Low Low Nominal High Very High 15th 35th 55th 75th 90th percentile percentile percentile percentile percentile 108

121 The total number of defects introduced = 3 j= 1 B A *( Size) * DI _ driver j 21 i= 1 ij Eq. 4.8 where: j identifies the 3 artifact types (requirements, design and coding). A is the baseline DI Rate Adjustment Factor. Size is the size of the software project measured in terms of ksloc (thousands of Source Lines of Code [Park, 1992] or Function Points [IFPUG, 1994] and Programming Language. B is initially set to 1 and accounts for economies / diseconomies of scale. It is unclear if Defect Introduction Rates will exhibit economies or diseconomies of scale as indicated in [Banker, 1994 and Gulledge, 1993]. The question is if Size doubles, then will the Defect Introduction Rate increase by more than twice the original rate? This indicates diseconomies of scale implying B > 1. Or will Defect Introduction Rate increase by a factor less than twice the original rate, indicating economies of scale, giving B < 1 [as illustrated in chapter 3 of Jones, 1997]? For each j th artifact, defining a new parameter, QAF j such that QAF j = 21 DI -driver Eq. 4.9 ij i = 1 where DI-driver ij is the Defect Introduction driver for the j th artifact and the i th factor. 109

122 Hence, the total number of defects introduced = 3 j= 1 B A *( Size) * QAF j Eq where for each of the three artifacts the model equations are as shown: req Requirements Defects Introduced (DI Est ; req ) = A *( Size) * QAF Bdes Design Defects Introduced (DI Est; des ) = A *( Size) * QAF Bcod Coding Defects Introduced (DI Est cod ) = A *( Size) * QAF For the empirical formulation of the Defect Introduction Model, as with COCOMO II, it was essential to assign numerical values to each of the ratings of the DI drivers. Based on expert-judgment an initial set of values was proposed for the model as shown in table 4.18 for an example DI-driver, Programmer Capability (PCAP). If the DI driver > 1 then it has a detrimental effect on the number of defects introduced and overall software quality; and if the DI driver < 1 then fewer number of defects are introduced improving the quality of the software being developed. This is analogous to the effect the COCOMO II multiplicative cost drivers have on effort. So, for example, for a project with programmers having Very High Capability ratings (Programmers of the 90 th percentile level), only 76% of the nominal number of defects will be introduced during the coding activity. Whereas, if the project had programmers with Very Low Capability ratings (Programmers of the 15 th percentile level), then 132% of the nominal number of coding defects will be introduced. This would cause the Defect Introduction Range to be des cod req B des cod req 110

123 1.32/0.76 = 1.77 for coding defects for PCAP, where the Defect Introduction Range is defined as the ratio between the largest DI driver and the smallest DI driver. To get further group consensus, I conducted a 2-round Delphi involving nine experts in the field of software quality. The nine participants selected for the Delphi process were representatives of Commercial, Aerospace, Government and FFRDC and Consortia organizations. Each of the participants had notable expertise in the area of software metrics and quality management and a few of them had developed their own proprietary cost/schedule/quality estimation models. The rates determined from the behavioral analyses in Appendix A and by Boehm (my Ph.D. advisor and committee chair who has considerable experience in Software Economics) and Chulani were used as initial values and each of the participants independently provided their own assessments (see Appendix A and [Chulani, 1997C]). The quantitative relationships and their potential range of variability for each of the 21 DI-drivers were summarized and sent back to each of the nine Delphi participants for a second assessment. The participants then had the opportunity to update their rates based on the summarized results of round 1. It was observed that the range in round 2 was typically narrower than the range in round 1 i.e. round 2 resulted in better agreement among the experts. The Delphi approach that was used is summarized below: Round 1 - Steps 1. Provided participants with round 1 Delphi questionnaire with a proposed set of values for the Defect Introduction Ranges. 2. Received nine completed round 1 Delphi questionnaires. 111

124 3. Ensured validity of responses by correspondence with the participants. 4. Did simple analysis based on ranges and medians of the responses. Round 2 - Steps 1. Provided participants with round 2 Delphi questionnaire, including response distributions and analysis. 2. Repeated steps 2, 3, 4 (above). 3. Converged to final Delphi results that resulted in the definition of the initial model. Figure 4.10 provides a graphical view of the relative Defect Introduction Ranges for Coding Defects provided by all 21 Defect Drivers. For example, If all other parameters are held constant, a Very Low (VL) rating for Process Maturity (PMAT) will result in a software project with 2.5 times the number of Coding Defects introduced as compared to an Extra High (XH) rating. The figure also illustrates that the experts opinion suggested that PMAT has the highest impact and RUSE has the lowest impact on the introduction of coding defects. A detailed description of each of the other 21 DIR drivers and its impact on defect introduction for each type of defect artifact is in Appendix A. 112

125 Figure 4.10: Coding Defect Introduction Ranges RUSE STOR TIME DATA ACAP AEXP PEXP DOCU SITE TEAM SCED LTEX PVOL PREC TOOL PCON PCAP CPLX RESL RELY PMAT Some initial data analysis (data gathered from a COCOMO II affiliate s 1990 s project) was used to update the 1970 s baseline DIRs of 5 requirements defects, 25 design defects and 15 coding defects. Table 4.19 illustrates the details. Type of Artifact 1970 s Baseline DIRs Table 4.19: Initial Data Analysis on the DI Model Quality Adjustment Factor (QAF j ) Predicted DIR Actual DIR; 1990's project Baseline DIR Adjustment Factor (A j ) 1990 s Baseline DIRs Reqts Design Code

126 The updated 1990 s nominal Defect Introduction Rates (i.e. the number of defects per ksloc without the impact of the Quality Adjustment Factor) are approximately 10 requirements defects, 20 design defects and 30 coding defects, i.e. DIR req;nom = 10, DIR des; nom = 20, DIR cod; nom = 30. For readers familiar with COCOMO, this is analogous to the nominal effort without the impact of the Effort Adjustment Factor. Note that for each artifact j, the exponent B j = 1, for the initial data analysis. When more data is available this factor will also be calibrated and may result in values other than 1. But for now, due to lack of enough datapoints and due to the lack of expert opinion on this factor, it has been set to The Software Defect Removal (DR) Sub-Model The aim of the Defect Removal (DR) model is to estimate the number of defects removed by several defect removal activities depicted as defect removal pipes in figure 3.8 in section The DR model is a post-processor to the DI model and is formulated by classifying defect removal activities into three relatively orthogonal profiles namely Automated Analysis, Peer Reviews and Execution Testing and Tools (see figure 4.11). Each of these three defect removal profiles removes a fraction of the requirements, design and coding defects introduced in the DI pipes of figure 3.8 described as the DI model in section Each profile has six levels of increasing defect removal capability, namely Very Low, Low, Nominal, High, 'Very High' and Extra High with Very Low being the least effective and Extra High being the most effective in defect removal. Table 4.20 describes the three profiles and the six levels for each of these profiles. 114

127 Figure 4.11: The Defect Removal Sub-Model of COQUALMO Number of non-trivial requirements, design and coding defects introduced Defect removal profile levels Defect Removal Sub- Model Number of residual defects/ksloc (or some other unit of size) Software Size Estimate The automated analysis profile includes code analyzers, syntax and semantics analyzers, type checkers, requirements and design consistency and traceability checkers, model checkers, formal verification and validation etc. The peer reviews profile covers the spectrum of all peer group review activities. The very low level is when no peer reviews take place and the extra high level is the other end of the spectrum when extensive amount of preparation with formal review roles assigned to the participants and extensive User/Customer involvement. A formal change control process is incorporated with procedures for fixes. Extensive review checklists are prepared with thorough root cause analysis. A continuous review process improvement is also incorporated with statistical process control. 115

128 Table 4.20: The Defect Removal Profiles 5 Rating Automated Analysis Peer Reviews Execution Testing and Tools Very Simple compiler syntax No peer review. No testing. Low checking. Low Basic compiler capabilities for static module-level code analysis, syntax, type-checking. Ad-hoc informal walkthroughs Minimal preparation, no Ad-hoc testing and debugging. Basic text-based debugger Nominal High Very High Extra High Some compiler extensions for static module and inter-module level code analysis, syntax, type-checking. Basic requirements and design consistency, traceability checking. Intermediate-level module and inter-module code syntax and semantic analysis. Simple requirements/design view consistency checking. More elaborate requirements/design view consistency checking. Basic distributed-processing and temporal analysis, model checking, symbolic execution. Formalized* specification and verification. Advanced distributed processing and temporal analysis, model checking, symbolic execution. *Consistency-checkable preconditions and post-conditions, but not mathematical theorems. follow-up. Well-defined sequence of preparation, review, minimal follow-up. Informal review roles and procedures. Formal review roles with all participants well-trained and procedures applied to all products using basic checklists, follow up. Formal review roles with all participants well-trained and procedures applied to all product artifacts & changes (formal change control boards). Basic review checklists, root cause analysis. Formal follow-up. Use of historical data on inspection rate, preparation rate, fault density. Formal review roles and procedures for fixes, change control. Extensive review checklists, root cause analysis. Continuous review process improvement. User/Customer involvement, Statistical Process Control. Basic unit test, integration test, system test process. Basic test data management, problem tracking support. Test criteria based on checklists. Well-defined test sequence tailored to organization (acceptance / alpha / beta / flight / etc.) test. Basic test coverage tools, test support system. Basic test process management. More advanced test tools, test data preparation, basic test oracle support, distributed monitoring and analysis, assertion checking. Metrics-based test process management. Highly advanced tools for test oracles, distributed monitoring and analysis, assertion checking Integration of automated analysis and test tools. Model-based test process management. The Execution Testing and Tools profile, as the name suggests, covers all procedures and tools used for testing with the very low level being when no testing takes 5 Each rating includes activities from the lower ratings on the scale. 116

129 place. Not much software development is done this way. The nominal level involves the use of a basic testing process with unit testing, integration testing and system testing with test criteria based on simple checklists and with a simple problem tracking support system in place and basic test data management. The extra high level involves the use of highly advanced tools for test oracles with the integration of automated analysis and test tools and distributed monitoring and analysis. Sophisticated model-based test process management is also employed at this level. To determine the Defect Removal Fractions (DRF) associated with each of the six levels (i.e. very low, low, nominal, high, very high, extra high) of the three profiles (i.e. automated analysis, peer reviews, execution testing and tools) for each of the three types of defect artifacts (i.e. requirements defects, design defects and code defects), I conducted a 2-round Delphi. Unlike the Delphi conducted for the DI model where initial values were provided, I did not provide initial values for the DRFs of the DR model. This decision was made when the participants wanted to see how divergent the results would be if no initial values were provided. 6 Fortunately though (as shown in Appendix A and table 4.21 for automated analysis), the results didn t diverge a lot and the outcome of the 2-round Delphi was a robust expert-determined DR model. 6 The Delphi was done at a workshop focused on COQUALMO held in conjunction with the 13 th International Forum on COCOMO and Software Cost Modeling in October The ten workshop participants included practitioners in the field of software estimation and modeling and quality assurance and most of them had participated in the Delphi rounds of the DI model. I am very grateful to the participants who not only attended the workshop but also spent a significant amount of their time providing follow-up and useful feedback to resolve pending issues even after the workshop was over. 117

130 VL L N H VH XH Table 4.21: Results of 2-Round Delphi for Defect Removal Fractions for Automated Analysis Round 1 Round 2 Median Range(min max) Median Range(min max) Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects The results of the round 2 Delphi (see Appendix A) were used as the DRFs to formulate the initial version of the DR model as shown in equation For artifact, j (requirements, design, code), DRes Est, j = C * DI * ( 1 DRF ) Eq j Est, j ij i where: DRes Est,j = Estimated number of residual defects for j th artifact 118

131 C j = Baseline DR constant for the j th artifact DI Est,j = Estimated number of defects of artifact type j introduced i = 1 to 3 for each DR profile, namely automated analysis, peer reviews, execution testing and tools DRF ij = Defect Removal Fraction for defect removal profile I and artifact type j Using the nominal DIRs (see last paragraph of section 4.3.1) and the DRFs of the second round of the Delphi (from Appendix A), the residual defect density is computed when each of the three profiles are at very low, low, nominal, high, very high and extra high levels (table 4.22). For example, with nominal ratings for the DI-drivers and Very Low ratings for each of the three DR profiles results in a residual defect density of 60 defects/ksloc. Similarly with nominal ratings for the DI-drivers and Very Low ratings for each of the three DR profiles results in a residual defect density of 1.57 defects/ksloc (see table 4.22). Thus, using the quality model described in this paper, one can conclude that for a project with nominal characteristics (or average ratings) for the DI-drivers and the DR profiles, the residual defect density is approximately 14 defects/ksloc. 119

132 VL L N H VH XH Table 4.22: Defect Density Results from Initial DRF Values Automated Analysis Peer Reviews Execution Testing and Tools Product (1-DRF ij ) DI/ ksloc 7 DRes/ ksloc Reqts Design Code Total: 60 Reqts Design Code Total: 28.5 Reqts Design Code Total: 14.3 Reqts Design Code Total: 7.5 Reqts Design Code Total: 3.5 Reqts Design Code Total: 1.57 The next few paragraphs attempt to validate the above expert-determined COQUALMO defect densities by other published reports on defect densities. Figure 4.12, shows the results of a study done at Hewlett Packard by Grady and Caswell in the late 1980s. It depicts the defect density i.e. number of defects per thousands of non-comment source statements (KNCSS) where the number in brackets above each bar is the actual size of the system s Baseline DIRs (Defect Introduction Rates), excluding DIR_driver effects. 120

133 Figure 4.12: Reported Defect Densities [Grady, 1987] It is very clear from the above figure that defect density varies considerably across the published results. This could partly be attributed to the differences in the definition of a counted defect. As Fenton and Pfleeger reported in their book [Fenton, 1997], these published results mostly report about anonymous third parties, making validation impossible [Bazzana, 1995, Daskalantonakis, 1992, Pfleeger, 1994, Wohlwend, 1994]. In spite of the difficulty in determining what was measured and in turn the validity of the numbers, a few researchers have published average defect densities in the US. Dyer showed that between 0-60 defects per ksloc are introduced and with approximately 85-90% defect removal efficiency between 0-10 residual defects per ksloc are observed in the USA and Europe [Dyer, 1992]. Jones did a study based on his clientele database of approximately 8500 projects to analyze the impact of software quality on the software labor shortage [Jones, 1998]. In his study, he classified the data as lagging, average, and leading software projects and 121

134 realized a significant difference in their residual defect densities. Leading software projects had a residual defect density of 1.4 defects/ksloc, average projects had a residual defect density of 7.5 defects/ksloc and laggard projects had a residual defect density of 18.3 defects/ksloc. Considering the SEI-published CMM-distribution of 807 organizations [SEI, 1999] where 51.1% are at level 1, 28.9% at level 2, 15.7% at level 3, 3.7% at level 4 and 0.6% at level 5; and assuming that organizations at level 1 are lagging, at level 2 are average and at levels 3, 4 and 5 are leading we get an average defect density using Jones data of 12 defects/ksloc. This maps quite well into the residual defect densities estimated by COQUALMO; which says that a nominal project will have a residual defect density of approximately 14.3 defects/ksloc Proposed DI Sub-Model to DR Sub-Model Rosetta-Stone Since data collection is a non-trivial task, every attempt to reduce the burden of data collection has been made. As discussed in section 4.3.1, a subset of the COCOMO II Post-Architecture parameters was used as the DI-drivers. Since, the data collection artifacts and process is already set up for COCOMO II, it will simplify the data collection process for the DI sub-model of COQUALMO. This sub-section provides an expert-determined rosetta-stone for using two of the DI-drivers (Required Software Reliability, RELY, and Process Maturity, PMAT) to map to the six levels of each of the three DR profiles. As pointed out in [Reifer, 1999], the original Rosetta Stone is a black slab, found by French troops in 1799 in Egypt, that contained three scripts (Greek, demotic and hieroglyphics), and was used by 122

135 archaeologists to translate a decree praising Egyptian King Ptolemy V. Figure 4.13 illustrates the purpose of the rosetta-stone being proposed here. Figure 4.13: DI Sub-Model to DR Sub-Model Rosetta-Stone Process Maturity (PMAT) Required S oftware Reliability (RELY) DI Sub-Model to DR S ub-model Rosetta S tone Defect Removal Profiles i.e. Automated Analysis, People Reviews, Execution Testing and Tools Table 4.23 on the following page has the expert-determined rosetta-stone. For example, if RELY and PMAT are both Nominal then all three DR profiles are set at Nominal resulting in approximately 74% of the defects being removed. Similarly, for another example if RELY = Nominal but PMAT = Very High, then approximately 92% of the defects introduced are removed An Independent Validation Study Our initial assumption was that quality data and effort data would be of roughly equal difficulty to collect. But we have found that organizations are more reluctant to provide quality data than they are to provide effort data. And, in some cases, organizations are concerned about providing quality and productivity data together. This made the data-collection process for COQUALMO quite difficult. 123

136 RELY Table 4.23: DI Sub-Model to DR Sub-Model Rosetta-Stone Defect Removal Profile VL L N H VH PMAT VL Cumulative DRF L Cumulative DRF N Cumulative DRF H Cumulative DRF VH Cumulative DRF XH Cumulative DRF Automated Analysis VL L L N N Peer Reviews VL L L N H Execution Testing & Tools VL L L N H Automated Analysis L L N H H Peer Reviews VL L N H H Execution Testing & Tools L L N N H Automated Analysis N N N H VH Peer Reviews L L N H VH Execution Testing & Tools L N N H VH Automated Analysis H H H VH VH Peer Reviews N H VH XH XH Execution Testing & Tools N H H VH XH Automated Analysis H H H VH VH Peer Reviews H H VH XH XH Execution Testing & Tools H VH VH VH XH Automated Analysis H H VH VH XH Peer Reviews VH VH XH XH XH Execution Testing & Tools VH VH VH XH XH

137 In an attempt to validate that the expert-determined COQUALMO has the right trends in defect rates, I did a study using results from a TRW project reported in [Thayer, 1978.] Project 3 was selected from the five projects reported due to the completeness of the data and since I had access to further information on the project. For the purposes of this study, I call it project A. Table 4.24 provides some of its characteristics. Table 4.24: Project A Characteristics Size ksloc Number of Requirements 188 Language JOVIAL J4 Formal Testing (in order of occurrence) Validation, Acceptance, Integration Operational Demonstration Rating DI-driver reqts DI-driver rdes DI-driver cod RELY H DATA N RUSE VH DOCU N CPLX N TIME H STOR H PVOL N ACAP H PCAP H PCON H AEXP VH PEXP H LTEX N TOOL VL SITE H SCED VL PREC L RESL N TEAM H PMAT N Quality Adjustment Factor (QAF) = DI-driver DI req (No. of Reqts Defects Introduced) 209 DI des (No. of Design Defects Introduced) 1043 DI cod (No. of Code Defects Introduced)

138 About 38% of the total number of defects are coding defects as published in [Thayer, 1978]. From the above table, the DIRs (Defect Introduction Rates) for each artifact j (where j refers to requirements defects, design defects and code defects) is the ratio of DI j and Size. Hence, DIR req = 2, DIR des = 10 and DIR cod = 6. Using the Delphi results (Appendix A and Chulani, 1997C) to determine the DI-driver ratings for each of the parameters provided in table 4.24, the Quality Adjustment Factor (QAF) for each of the artifacts are computed resulting in QAF req = 0.74, QAF des = 0.74 and QAF cod = This causes the baseline DIRs to be 2, 13 and 7 for requirements, design and code defects. These results are tabulated in table Comparing Project A with the 1970 s projects, we notice that Project A has approximately half the Baseline DIRs. This could be due to undercounting since only defects caught during testing are included in the project A data whereas the 1970s project data includes defects counted from the beginning of the development life-cycle. An interesting observation is that the relative magnitudes of the baseline DIRs inferred from the Project A data are quite similar to the relative magnitudes of the 1970 s baseline DIRs. Table 4.25: Project A Defect Introduction Rates Type of Artifact DI DIR Quality Adjustment Factor (QAF j ) Baseline DIR 1970 s Baseline DIR Reqts Design Code The next step is to compare the actual residual defect density of Project A with the residual defect density estimated by the Defect Removal Model of COQUALMO using the above DIRs. Using the data presented in [Thayer, 1978] and mapping it to the life-cycle activities covered by COCOMO II, the residual number of post-acceptance 126

139 defects is determined as 674. Hence, the actual residual defect density for Project A is 674/110.5 = 6 defects/ksloc. Table 4.26 provides information on the three defect removal profiles used to model COQUALMO s Defect Removal Model. Using the Delphi results for the Defect Removal Fractions and interpolating between the rating scales for Peer reviews and Execution Testing and Tools, we get the DRFs as shown in table Table 4.27 illustrates how the Defect Removal Model computes the estimated residual defect density using the defect removal profile ratings from table Defect Removal Profile Automated Analysis Peer Reviews Execution Testing and Tools Table 4.26: Project A Defect Removal Profiles Associated Rating Very Low; Simple compiler syntax checking. In between Very Low and Low; Ad-hoc informal reviews done In between High and Very High; Exhaustive testing done on all the artifacts but without the use of advanced test tools and metricsbased test process management DRF Reqts 0 Design 0 Code 0 Reqts 0.13 Design 0.14 Code 0.15 Reqts 0.54 Design 0.61 Code 0.74 Table 4.27: Project A Residual Defect Density Type of Artifact Automated Analysis Peer Reviews Execution Testing and Tools Product (1-DRF) DI/ ksloc DRes/ ksloc Reqts Design Code From table 4.27, we get the total estimated residual defect density of 5.57 defects/ksloc. This is very close to the actual defect density of 6 defects/ksloc. It would be nice if we could break down the actual residual defect density of 6 defects/ksloc by source and compare them to the DRes/kSLOC column in table

140 But, data to enable defect distribution by artifact was not available and hence such a comparison could not be done. This concludes the independent study, which validates two important findings: (i) the trends in defect introduction rates determined from Project A s post-unit-test defect rates and COQUALMO s DI drivers are very similar to those published in [Boehm, 1981]. (ii) the actual post-acceptance defect density on Project A is estimated within 7% by the Project A DI rates and COQUALMO s DR fractions COQUALMO Integrated with COCOMO II The DI and DR Sub-Models described above can be integrated to the existing COCOMO II cost, effort and schedule estimation model as shown in figure The dotted lines in figure 4.14 are the inputs and outputs of COCOMO II. In addition to the sizing estimate and the platform, project, product and personnel attributes, COQUALMO requires the defect removal profile levels as input to predict defect density. Figure 4.14: COQUALMO Integrated with COCOMO II Software Size estimate COCOMO II COQUALMO Software development effort, cost and schedule estimate Software platform, project, product and personnel attributes Defect Introduction Model Number of residual defects Defect density per unit of size Defect removal profile levels Defect Removal Model 128

141 4.3.6 Conclusions on COQUALMO Section 4.3 described the expert-determined Defect Introduction and Defect Removal sub-models that compose the quality model extension to COCOMO II, namely COQUALMO. As discussed in chapter 3, COQUALMO is based on the tank-and-pipe model where defects are introduced through several defect source pipes described as the Defect Introduction model and removed through several defect elimination pipes modeled as the Defect Removal model. This section discussed the Delphi approach used to calibrate the initial version of the model to expert-opinion. The expert-calibrated COQUALMO when used on a project with nominal characteristics (or average ratings) predicts that approximately 14 defects per ksloc are remaining. An independent study done on a TRW project verified the trends in defect rates modeled by COQUALMO. When more data on actual completed projects is available, the model can be refined using the Bayesian approach. This statistical approach has been successfully used to calibrate COCOMO II.1999 to 161 projects as discussed in section 4.2. The Bayesian approach can be used on COQUALMO to merge expert-opinion and project data, based on the variance of the two sources of information to determine a more robust posterior model. In the meanwhile, the model described in this chapter can be used as is or can be locally calibrated to a particular organization to predict the cost, schedule and residual defect density of the software under development. Extensive sensitivity analyses to understand the interactions between these parameters to do tradeoffs, risk analysis and return on investment can also be done. 129

142 CHAPTER 5: Summary of Contributions and Future Research Directions 5.1 Introduction Chapter 5 summarizes the contributions of this dissertation work to the software engineering community. It also provides a few suggestions for future research directions. 5.2 Summary of Conclusions 1. The Bayesian approach discussed in chapter 4 resolves one of the biggest problems faced by the software engineering community; the challenge of making good decisions using data that is usually scarce and incomplete. Most of the current empirical software engineering cost models, are calibrated using some form of the multiple regression approach. Chapter 4 presented the problems associated with software engineering data and showed how the Bayesian approach can be effectively used to resolve them, thereby increasing the accuracy of the COCOMO II model calibrated to 161 datapoints. When the Bayesian model calibrated using a dataset of 83 projects is validated on a dataset of 161 projects, it yields a prediction accuracy of PRED(.30) = 66% (i.e. 106 or 66% of the 161 datapoints are estimated within 30% of the actuals). Whereas the pure-regression based model calibrated using 83 datapoints when validated on the same 161 project dataset yields a poorer accuracy of PRED(.30) = 44%. 130

143 2. A comprehensive framework for employing the Bayesian approach to develop robust software engineering models was presented in chapter 3. This approach has worked successfully on COCOMO II and is being employed for COQUALMO and some other extensions of COCOMO II. The framework provides a logically consistent and formal way of making use of experiencebased expert judgement data along with sampling information in the decisionmaking process. In many models, such prior information is informally used to evaluate the "appropriateness" of results. An important aspect of formalizing the use of prior information is that when others know what prior production functions are being used they can repeat the calibration calculations (or can incorporate different prior information in a similar way). 3. A quality model extension to COCOMO II, COQUALMO, has been developed. Due to lack of many datapoints to empirically calibrate COQUALMO, an expert-determined a-priori model has been developed. Once enough sampling information is available, the model can be calibrated to merge the prior and the sampling sources of information using the Bayesian approach. In the meanwhile, the model described in chapter 4 can be used as is or can be locally calibrated to a particular organization to predict the cost, schedule and residual defect density of the software under development. Extensive sensitivity analyses to understand the interactions between these parameters to do tradeoffs, risk analysis and return on investment can also be effectively done. 131

144 5.3 Future Research Directions 1. COCOMO II has twenty-two parameters and this makes it difficult to statistically calibrate the model without running into problems of overfitting. Although, the argument is that COCOMO II is a comprehensive cost model explaining the entire development life-cycle. Lets consider our data collection process, when we faced a lot of problems with the parameter, Develop for Reuse (RUSE), giving strong indications to drop the variable. But, given such other sources of develop-for-reuse data as [Poulin, 1997], deleting the variable and telling the users of the COCOMO II model that developing for reuse has no impact on current development effort is inadvisable. In practice, removing a predictor variable is equivalent to stipulating that variations in this variable have no effect on project effort. When the experts in the field and the detailed behavioral analyses suggest otherwise, extremely strong evidence is needed to drop a variable. Hence, further research needs to be done to resolve this issue either by collecting more data or by understanding if fewer than twenty-two parameters can be used to develop a more parsimonious model without sacrificing its coverage. 2. COCOMO II has faced the problems of measurement error (popularly known as the errors-in-variables problem). This is a violation of the assumption made by the multiple regression approach (COCOMO II.1997 calibration) and the Bayesian approach (COCOMO II.1999 calibration). But, the current COCOMO II data doesn t lend itself to addressing this problem. Hence for the 132

145 purposes of this dissertation, the assumption has been made that the random variation in the responses for the parameters, in particular RUSE, is small compared to the range of the parameter. It should be noted that all other cost models that use the multiple regression approach rarely explicitly state this assumption, even though it is implicitly assumed. But, there is a lot of available literature on research done in other domains that attempt at reducing the effects of measurement error [Fuller, 1987]. A study to understand how these problems can be resolved to improve the accuracy of the COCOMO II model would be highly recommended. 3. The prior information for the Bayesian framework is obtained by sampling experts in the field of study and getting their opinion on impact of parameters on the quantities to be estimated. It would be interesting to study how the prior information ages and how it should be used for future calibrations when updated sampling data is available, i.e. should the experts be re-sampled or is there a data-driven way of reasonably updating the prior information? 4. The current calibration of COCOMO II is limited to effort for the entire development life cycle. A future research activity could involve collecting data on effort spent in different activities and studying the distribution of effort by activity. 5. The dissertation presented improvement in accuracy when the data was stratified based on the source. A similar study when the data is stratified based on language and/or domain would be very interesting. 133

146 6. A dynamic model enabling critical path analysis and updating. Such a model could be used to update estimates when more data becomes available as the development progresses along. 7. For the expert-determined COQUALMO presented in this dissertation, the first obvious research activity is to collect sampling information and calibrate the model to actual completed projects to validate the model structure. Our initial assumption was that quality data and effort data would be of roughly equal difficulty to collect, but we have found that organizations are even more reluctant to provide quality data than to provide effort data. 8. COQUALMO estimates the defect density of non-trivial defects but for simplicity doesn t attempt to categorize the results any further based on severity. This is an important research activity for the downstream success of COQUALMO. 9. COQUALMO does not make an attempt to model the cost of defect removal over time. This is difficult to model in the absence of quality data and hence should be considered when enough data is available. 10. A study of the impact of COTS integration on effort is currently underway with the COCOTS effort (COCOTS is one of the few extensions of COCOMO II. Please refer to A similar study on the impact of COTS on quality would be highly beneficial. On similar grounds, the study of the quality of reusing code or sub-contracting code would also be very useful. 134

147 11. The current COCOMO II data doesn t show any significant differences in prediction accuracy among small, medium and large size projects (see figure 4.5 in chapter 4). It would be interesting to do a similar study to see if there is an impact of increasing size of residual defect density. Jones data indicates that as Size increases defect density increases (see chapter 3 in [Jones, 1997]). Such a study can be done only after more quality data is collected. In summary, even though a cookbook for software cost and quality estimation is hard to find, the field has made very good progress in the last couple of decades. The current status of the practice can provide excellent practical insight to the estimation arena, although researchers need to give keen attention to a few problem areas. 135

148 Bibliography Abdel-Hamid, Software Project Dynamics, Abdel-Hamid and Stuart Madnick, Englewood Cliffs, NJ, Prentice Hall Bailey, A Meta-Model for Software Development Resources Expenditures, Bailey J.W. and Basili V. R., 5 th International Conference on Software Engineering, IEEE Press, pp , Banker, An Empirical Test of Object-based Output Measurement Metrics in a Computer Aided Software Engineering (CASE) Environment, Rajiv D. Banker, Robert J. Kauffman and Rachna Kumar, Banker, Evidence on Economies of Scale in Software Development, Rajiv D. Banker, Hsihui Chang, Chris F. Kemerer, Information and Software Technology, pp , Bazzana, Software Management by Metrics: Practical Experiences in Italy, G. Bazzana, P. Caliman, P. Gandini et al. Software Quality Assurance and Measurement, Eds. N. Fenton, R. Whitty, Y. Iizuka, International Computer Press, London, pp , Boehm, Software Engineering Economics, Barry W. Boehm, Prentice-Hall, Boehm, Cost Models for Future Software Life Cycle Processes: COCOMO 2.0, Boehm, B., B. Clark, E. Horowitz, C. Westland, R. Madachy, R. Selby, Annals of Software Engineering Special Volume on Software Process and Product Measurement, J.D. Arthur and S.M. Henry, Eds., J.C. Baltzer AG, Science Publishers, Amsterdam, The Netherlands, Vol. 1, pp , Briand, A Pattern Recognition Approach for Software Engineering Data Analysis, Lionel C. Briand, Victor R. Basili and William M. Thomas, IEEE Transactions on Software Engineering, Vol. 18, No. 11, November Box, Bayesian Inference in Statistical Analysis, George Box and George Tiao, Addison Wesley, Chidamber, A Metrics Suite for Object Oriented Design, Shyam Chidamber and Chris Kemerer, CISR WP No. 249 and Sloan WP No , Center for Information Systems Research, Sloan School of Management, Massachusetts Institute of Technology,

149 Chulani, 1997A - Modeling Software Defect Introduction, Sunita Devnani-Chulani, California Software Symposium, November Chulani, 1997B - Calibration Results of COCOMOII.1997, Sunita Devnani-Chulani, Brad Clark, Barry Boehm, 22 nd Software Engineering Workshop, NASA-Goddard, December Chulani, 1997C - Results of Delphi for the Defect Introduction Model, Sunita Devnani-Chulani, USC-CSE , Chulani, Calibration Approach and Results of the COCOMO II Post Architecture Model, Sunita Chulani, Brad Clark, Barry Boehm and Bert Steece, 20th Annual Conference of the International Society of Parametric Analysts (ISPA) and the 8th Annual Conference of the Society of Cost Estimating and Analysis (SCEA), June 98. Conte, Software Engineering Metrics and Models, S. D. Conte, Benjamin Cummings, Menlo Park, CA, Cook, An Introduction to Regression Graphics, Dennis Cook and Sanford Weisberg, Wiley Series, CSE, Center for Software Engineering, COCOMO II Model Definition Manual, Computer Science Department, USC Center for Software Engineering, Cuelenaere, Calibrating software cost estimation model: why and how, A. M Cuelenaere, van Genuchten and F. J. Heemstra, Information and Software Technology, 29 (10), pp , Daskalantonakis, A Practical View of Software Measurement and Implementation Experiences within Motorola, M. Daskalantonakis, IEEE Transactions on Software Engineering, Vol. 18, No. 11, November Dyer, The Cleanroom Approach to Quality Software Development, Michael Dyer, Wiley Series in Software Engineering Practice, Farr, A Survey of Software Reliability Modeling and Estimation, NSWC TR- 171, W. H. Farr, Naval Surface Warfare Center, Farquhar, A Preliminary Inquiry Into the Software Estimation Process, J. A. Farquhar, RM-6271-PR, The Rand Corporation, Fenton, Software Metrics: A Rigorous Approach, N. E. Fenton, Chapman and Hall, London,

150 Fenton, Software Metrics: A Rigorous and Practical Approach, N. Fenton and Pfleeger S., ITP, London, Forrester, Industrial Dynamics, Cambridge, MA, Forrester J. W., MIT Press, Forrester, Principles of Systems, Cambridge, MA, Forrester J. W., MIT Press, Frieman, PRICE Software Model. Version 3. An Overview, Freiman F. R. and Park R. D., Proceedings of IEEE-PINY Workshop on Quantitative Software Models for Reliability, Fuller, Measurement Error Models, Wayne A. Fuller, Wiley Series in Probability and Mathematical Statistics, Gaffney, An Approach to Estimating Software Errors and Availability, J. E. Gaffney and C. F. Davis, Proceedings of the 11 th Minnowbrook on Software Reliability, July Gaffney, An Automated Model for Software Early Error Prediction (SWEEP), Proceedings of the 13 th Minnowbrook on Software Reliability, July Gelman, Bayesian Data Analysis, Andrew Gelman, John Garlin, Hal Stern, Donald Rubin, Chapman Hall, Grady, Software Metrics: Establishing a Company-wide Program, Grady B. and Caswell D., Prentice Hall, Gray, A Comparison of Techniques for Developing Predictive Models for Software Metrics, Andrew R. Gray and Stephen G. MacDonnell, Information and Software Technology 39, Goel, Time-Dependent Error-Detection Rate Model for Software and Other Performance Measures, IEEE Transactions on Reliability, vol. R-28, no. 3, pp , August Gulledge, Analytical Methods in Software Engineering Economics, Thomas R. Gulledge and William P. Hutzler, Springer-Verlag, Helmer, Social Technology, O. Helmer, Basic Books, NY, Henderson, Object Oriented Metrics - Measures of Complexity, Henderson- Sellers, B., Prentice Hall, Upper Saddle River, NJ,

151 IFPUG, International Function Point Users Group (IFPUG), Function Point Counting practices Manual, Release 4.0, Jeffery, Calibrating estimation tools for software development, D. R. Jeffery and G. C. Low, Software Engineering Journal, 5 (4), pp , Jensen, A Comparison of the Jensen and COCOMO Schedule and Cost Estimation Models, Jensen R. W., Proceedings of the International Society of Parametric Analysts, pp , April Johnson, Expertise and decision under uncertainty: Performance and Process, E. J. Johnson, The Nature of Expertise, Editors Chi, Glaser, Farr, Lawrence Earlbaum Associates, Jones, Programming Defect Removal, Capers Jones, Proceedings, GUIDE 40, Jones, Applied Software Measurement, Capers Jones, McGraw Hill, Jones, The Impact of Poor Quality and Canceled Projects on the Software Labor Shortage, Capers Jones, Technical Report, Software Productivity Research, Inc.(an Artemis company), Judge, The Theory and Practice of Econometrics, George G. Judge, W. E. Griffiths, R. Carter Hill, Helmut Lutkepohl, Tsoung-Chao Lee, Wiley, Judge, Learning and Practicing Econometrics, George G. Judge, William Griffiths, R. Carter Hill, Wiley, Kan, Metrics and Models in Software Quality Engineering, Stephen H. Kan, Addison-Wesley, Karunanithi, Using Neural Networks in Reliability Prediction, IEEE Software, vol. 9, no. 4, pp , July Kauffman, Modeling Estimation Expertise in Object Based ICASE Environments, Kauffman, R., and R. Kumar, Stern School of Business Report, New York University, January Kemerer, An Empirical Validation of Software Cost Estimation Models, Chris F. Kemerer, Communications of the ACM, Volume 30, Number 5, May Khoshgoftaar, Application of Neural Networks for predicting program faults, T.M. Khoshgoftaar, A.S. Pandya, D.L. Lanning, Annals of Software Engineering, Vol. 1,

152 Kitchenham Software Cost Models, B. A. Kitchenham and N. R. Taylor, ICL Technical Journal, May Leamer, Specification Searches, Ad hoc Inference with Nonexperimental Data, Edward E. Leamer, Wiley Series, Lin, Software Engineering Process Simulation Model, Lin C., Abdel-Hamid T., Sherif J., TDA Progress Report , JPL, Feb Littlewood, A Bayesian Reliability Growth Model for Computer Software, B. Littlewood and J. Verrall, Journal of the Royal Statistical Society, series C, vol. 22, no. 3, pp , Littlewood, Validation of a Software Model, B. Littlewood, Software Life Cycle Management Workshop, Atlanta, Georgia, International Services Business, Inc., pp , Littlewood, Conceptual Modeling of Coincidental Failures in Multiversion Software, B. Littlewood, IEEE Transactions on Software Engineering, vol. 15, no. 12, pp , Lyu, Handbook of Software Reliability Engineering, Michael R. Lyu, IEEE Computer Society Press, Madachy, A Software Project Dynamics Model for Process Cost, Schedule and Risk Assessment, Raymond Madachy, Ph.D. dissertation, USC, Madachy, Heuristic Risk Assessment Using Cost Factors, Raymond Madachy, IEEE Software, May/June Masters, An overview of software cost estimating at the National Security Agency, T. F. Masters, Journal of Parametrics, 5 (1), pp , Mohanty, Software Cost Estimation: Present and Future, S. N. Mohanty Software Practice and Experience, 11, pp , 1981 Minkiewicz, Measuring Object Oriented Software with Predictive Object Points, Arlene Minkeiwicz, PRICE Systems, Moranda, Final Report on Software Reliability Study, P. L. Moranda and Z. Jelinski, McDonnell Douglas Astronautics Company, MADC Report, No, 63921, Mullet, Why regression coefficients have the wrong sign, G. M. Mullet, Journal of Quality Technology,

153 Musa, A Logarithmic Poisson Execution Time Model for Software Reliability Measurement, J. D. Musa and K. Okumoto, Proceedings Seventh International Conference on Software Engineering, Orlando, Florida, pp , 1984 Nelson, Management Handbook for the Estimation of Computer Programming Costs, E. A. Nelson, AD-A648750, Systems Development Corp., Oct Park, The Central Questions of the PRICE Software Cost Model, Park R., 4 th COCOMO Users Group Meeting, November Park, Software Size Measurement: A Framework for Counting Source Statements, Park, CMU-SEI-92-TR-20, Software Engineering Institute, Pittsburg, PA, Pfleeger, The Economics of Reuse: New Approaches to Modeling Cost, Pfleeger S. and Bollinger T., Information and Software Technology, 32(10), pp , December Poulin, Measuring Software Reuse, Principles, Practices and Economic Models, Jeffrey S. Poulin, Addison Wesley, Putnam, Measures for Excellence, Lawrence H, Putnam and Ware Myers, Yourdon Press Computing Series, Reifer, Softcost-R, Costar User s Manual, D. Reifer, RCI, ReiferA, Softcost-Ada, Costar User s Manual, D. Reifer, RCI, ReiferB, Softcost-OO, Costar User s Manual, D. Reifer, RCI, Reifer, Practical Software Reuse, D. Reifer, John Wiley and Sons, New York, Reifer, The Rosetta Stone: Making COCOMO Estimates Work With COCOMO II, D. Reifer, B. Boehm, S. Chulani, Crosstalk, The Journal of Defense Engineering, February Remus, Prediction and Management of Software Quality, H. Remus and S. Zilles, Proceedings of the 4 th International Conference on Software Engineering, IEEE Computer Society Press, New York, pp , Richardson, System Dynamics: Simulation Modeling and Analysis, Richardson G. P., Fishwich and Luker, eds., Springer-Verlag,

154 Rubin, Macroestimation of Software Development Parameters: the Estimacs System, Rubin H.A., in SOFTFAIR Conference Development Tools, Techniques and Alternatives, Arlington, IEEE Press, New York, pp.4-16, July Shepperd, Estimating Software Project Effort Using Analogies, M. Shepperd and C. Schofield, IEEE Transactions on Software Engineering, Vol. 23, No. 11, November SEI, Process Maturity Profile of the Software Community: 1998 Year End Update, Carnegie Mellon University, Software Engineering Institute's Technical Report, Mar Tausworthe, Deep Space Network Software Cost Estimation Model, Tausworthe R.C., Jet Propulsion Laboratory Publication 81-7, Pasadena, CA, Thayer, Software Reliability, Thomas Thayer, Myron Lipow, Eldred Nelson, TRW Series of Software Technology 2, Trachtenburg, Discovering how to ensure Software Reliability, Trachtenburg M., RCA Engineer, pp , Jan/Feb Vicinanza, "Software Effort Estimation: An exploratory study of expert performance, S. Vicinanza, T. Mukhopadhyay, M Prietula, Information Systems 2, pp , 1991 Weisberg, Applied Linear Regression, Weisberg, S., 2nd Ed., John Wiley and Sons, New York, N.Y., Wittig, Using Artificial Neural Networks and Function Points to Estimate 4GL Software Development Effort, Wittig G. E. and Finnie G. R., Australian Journal of Information Systems, Wohlwend, Schlumberger's Software Improvement Program, Wohlwend H. and Rosenbaum S., IEEE Transactions on Software Engineering, 20(11), pp , Zellner, Applications of Bayesian Analysis and Econometrics, The Statistician, Vol. 132, pp ,

155 APPENDIX A: COCOMO II and COQUALMO Delphi Results Appendix A presents the results of the three Delphis conducted for COCOMO II (in section A.1) and the two submodels of COQUALMO, namely the Defect Introduction (in section A.2) and the Defect Removal (in section A.3) models. A.1 COCOMO II Delphi Results Parameter Initial Productivity Range Round 1 Round 2 Round 3 (only for the five scale factors) Median Min Max Median Min Max Median Min Max PREC FLEX RESL TEAM PMAT RELY DATA RUSE DOCU CPLX TIME STOR PVOL ACAP PCAP PCON AEXP PEXP LTEX TOOL SITE SCED

156 A.2 Delphi Results of Defect Introduction Sub-Model of COQUALMO SCALE FACTORS Precedentedness (PREC) PREC level Requirements Design Code XH Fewer Requirements defects due to less learning and fewer false starts Fewer Requirements understanding defects Fewer Requirements defects since very little concurrent development of associated new hardware or operational procedures Fewer Design defects due to less learning and fewer false starts Fewer Requirements traceability defects Fewer Design defects since very little concurrent development of associated new hardware or operational procedures Fewer Design defects since minimal need for innovative data processing architectures, algorithms Fewer defects introduced in fixing requirements, preliminary design fixes 0.70 (VH = 0.84) 0.75 (VH = 0.87) Nominal Nominal level of defect introduction 1.0 VL More Requirements defects due to more learning and more false starts More Requirements understanding defects More Requirements defects since extensive concurrent development of associated new hardware or operational procedures More Design defects due to more learning and more false starts More Requirements traceability defects More Design defects since extensive concurrent development of associated new hardware or operational procedures More Design defects since considerable need for innovative data processing architectures, algorithms More defects introduced in fixing requirements, preliminary design fixes 1.34 Fewer Coding defects due to less learning Fewer Coding defects due to requirements, design shortfalls Fewer Coding defects since very little concurrent development of associated new hardware or operational procedures Fewer Coding defects since minimal need for innovative data processing architectures, algorithms 0.81 (VH = 0.90) More Coding defects due to more learning More Coding defects due to requirements, design shortfalls More Coding defects since extensive concurrent development of associated new hardware or operational procedures More Coding defects since considerable need for innovative data processing architectures, algorithms Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

157 Development Flexibility (FLEX) FLEX level Requirements Design Code XH Slightly more defects due to more in-process requirements changes Slightly fewer defects due to ability to relax tight schedule 1.0 Slightly more defects due to more in-process requirements changes Slightly fewer defects due to ability to relax tight schedule 1.0 Slightly more defects due to more in-process requirements changes Slightly fewer defects due to ability to relax tight schedule 1.0 Nominal Nominal level of defect introduction VL Slightly fewer defects due to more in-process requirements changes Slightly more defects due to ability to relax tight schedule Slightly fewer defects due to more in-process requirements changes Slightly more defects due to ability to relax tight schedule 1.0 Slightly fewer defects due to more in-process requirements changes Slightly more defects due to ability to relax tight schedule 1.0 Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

158 Architecture/Risk Resolution (RESL) RESL level Requirements Design Code XH Fewer Requirements defects due to - fewer number of highly critical risk items - very little uncertainty in key architecture drivers : mission user interface, COTS, hardware, technology, performance - thorough planning, specs, reviews and validation 0.76 (VH = 0.87) Fewer Design defects due to - fewer number of highly critical risk items - very little uncertainty in key architecture drivers : mission user interface, COTS, hardware, technology, performance - thorough planning, specs. reviews and validation 0.70 (VH = 0.84) Fewer Coding defects due to - fewer number of highly critical risk items - very little uncertainty in key architecture drivers : mission user interface, COTS, hardware, technology, performance - fewer misunderstandings in interpreting incomplete or ambiguous specs 0.71 (VH = 0.84) Nominal Nominal level of defect introduction VL More Requirements defects due to - higher number of highly critical risk items - high uncertainty in key architecture drivers : mission user interface, COTS, hardware, technology, performance - minimal planning, specs, reviews and validation 1.0 More Design defects due to - higher number of highly critical risk items - high uncertainty in key architecture drivers : mission user interface, COTS, hardware, technology, performance - minimal planning, specs, reviews and validation More Coding defects due to - higher number of highly critical risk items - high uncertainty in key architecture drivers : mission user interface, COTS, hardware, technology, performance - minimal planning, specs, reviews and validation - more misunderstandings in interpreting incomplete or ambiguous specs Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

159 Team Cohesion (TEAM) TEAM level Requirements Design Code XH Fewer Requirements defects due to full willingness of stakeholders to accommodate other stakeholders objectives and extensive teambuilding to achieve shared vision and commitments Fewer Requirements understanding, completeness and consistency defects due to shared context, good communication support and extensive experience of stakeholders operating as a team 0.75 (VH = 0.87) Fewer Requirements traceability defects, Design completeness, consistency defects due to shared context, good communication support and extensive experience of stakeholders operating as a team Fewer Design defects due to full willingness of stakeholders to accommodate other stakeholders objectives and extensive teambuilding to achieve shared vision and commitments Fewer defects introduced in fixing defects 0.80 (VH = 0.90) Fewer Coding defects due to requirements, design shortfalls and misunderstandings, shared coding context, good communication support and extensive experience of stakeholders operating as a team Fewer Coding defects due to extensive teambuilding to achieve shared vision and commitments Fewer defects introduced in fixing defects 0.85 (VH = 0.92) Nominal Nominal level of defect introduction 1.0 VL More Requirements defects due to little willingness of stakeholders to accommodate other stakeholders objectives and lack of teambuilding to achieve shared vision and commitments More Requirements understanding, completeness and consistency defects due to lack of shared context, poor communication support and lack of experience of stakeholders operating as a team 1.34 More Requirements traceability defects, Design completeness, consistency defects due to lack of shared context, poor communication support and lack of experience of stakeholders operating as a team More Design defects due to lack of willingness of stakeholders to accommodate other stakeholders objectives and lack of teambuilding to achieve shared vision and commitments More defects introduced in fixing defects More Coding defects due to requirements, design shortfalls and misunderstandings, lack of shared coding context, poor communication support and lack of experience of stakeholders operating as a team More Coding defects due to lack of teambuilding to achieve shared vision and commitments More defects introduced in fixing defects 1.18 Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

160 Process Maturity (PMAT) PMAT level Requirements Design Code XH Fewer Requirements defects due to - better requirements management - better training resulting in fewer false starts - defect prevention activities - technology and process improvements - better peer reviews, QA, CM 0.73 (VH = 0.85) Fewer Design defects due to - better requirements management - better training - defect prevention activities - technology and process improvements - better peer reviews, QA, CM 0.61 (VH = 0.78) Fewer Coding defects due to - better training - defect prevention activities - technology and process improvements - better peer reviews, QA, CM 0.63 (VH = 0.79) Nominal Nominal level of defect introduction VL More Requirements defects due to - lack of good requirements management - lack of training resulting in more false starts - lack of defect prevention activities - lack of technology and process improvements - lack of thorough peer reviews, QA, CM More Design defects due to - lack of good requirements management - lack of training - lack of defect prevention activities - lack of technology and process improvements - lack of thorough peer reviews, QA, CM More Coding defects due to - lack of training - lack of defect prevention activities - lack of technology and process improvements - lack of thorough peer reviews, QA, CM Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

161 EFFORT MULTIPLIERS Required Software Reliability (RELY) RELY level Requirements Design Code VH Fewer Requirements Completeness, consistency defects due to detailed verification, QA, CM, standards, SSR, documentation, IV&V interface, test plans, procedures 0.70 Fewer Design defects due to detailed verification, QA, CM, standards, PDR, documentation, IV&V interface, design inspections, test plans, procedures 0.69 Fewer Coding defects due to detailed verification, QA, CM, standards, documentation, IV&V interface, code inspections, test plans, procedures 0.69 Nominal Nominal level of defect introduction VL More Requirements Completeness, consistency defects due to minimal verification, QA, CM, standards, PDR, documentation, IV&V interface, test plans, procedures More Design defects due to minimal verification, QA, CM, standards, PDR, documentation, IV&V interface, design inspections, test plans, procedures 1.45 More Coding defects due to minimal verification, QA, CM, standards, PDR, documentation, IV&V interface, code inspections, test plans, procedures 1.45 Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

162 Data Base Size (DATA) DATA level Requirements Design Code VH More Requirements defects due to - complex database design and validation - complex HW/SW storage interface Nominal L 1.07 Nominal level of defect introduction Fewer Requirements defects due to - simple database design and validation - simple HW/SW storage interface More Design defects due to - complex database design and validation - complex HW/SW storage interface - more data checking in program Fewer Design defects due to - simple database design and validation - simple HW/SW storage interface - simple data checking in program 0.91 More Coding defects due to - complex database development - complex HW/SW storage interface - more data checking in program 1.10 Fewer Coding defects due to - simple database development - Simple HW/SW storage interface - simple data checking in program Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

163 Required Reusability (RUSE) RUSE level Requirements Design Code XH More Requirements defects due to higher complexity in design, validation and interfaces. Fewer Requirements defects due to more thorough interface analysis 1.05 More Design defects due to higher complexity in design, validation and interfaces Fewer Requirements defects due to more thorough interface analysis 1.02 More Coding defects due to higher complexity in design, validation and interfaces Fewer Requirements defects due to more thorough interface definitions 1.02 Nominal Nominal level of defect introduction L Fewer Requirements Defects due to lower complexity in design, validation and interfaces. 1.0 Fewer Design defects due to lower complexity in design, validation, test plans and interfaces 0.98 Fewer Coding defects due to lower complexity in design, validation, test plans and interfaces Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

164 Documentation Match to Life-Cycle Needs (DOCU) DOCU level Requirements Design Code VH Fewer Requirements defects due to good quality detailed documentation of the requirements analysis 0.86 Fewer Design defects due to good quality detailed documentation of the requirements analysis and product design 0.85 Fewer Coding defects due to - good quality detailed documentation of the requirements analysis and product design - fewer defects in requirements and design 0.85 Nominal Nominal level of defect introduction VL More Requirements defects due to minimal documentation of the requirements analysis (which may not be of good quality) 1.0 More Design defects due to minimal documentation of the requirements analysis and product design (which may not be of good quality) More Coding defects due to - minimal documentation of the requirements analysis and product design (which may not be of good quality) - more defects in requirements and design Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

165 Product Complexity (CPLX) CPLX level Requirements Design Code XH More Requirements understanding defects More Requirements Defects due to - complex specification and validation - complex interfaces 1.32 More Design defects due to - complex design and validation - complex interfaces 1.41 More Coding defects due to - complex data and control structures - complex interfaces More Coding defects due to Requirements and Design shortfalls 1.41 Nominal Nominal level of defect introduction VL Fewer Requirements understanding defects Fewer Requirements Defects due to - simpler specification and validation - simpler interfaces Fewer Design defects due to - simpler design and validation - simpler interfaces Fewer Coding defects due to - simpler data and control structures - simpler interfaces Fewer Coding defects due to Requirements and Design shortfalls Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

166 Execution Time Constraint (TIME) TIME level Requirements Design Code XH More Requirements Defects due to trickier analysis, complex interface design, test plans and more planning 1.08 More Design Defects due to trickier analysis, complex interface design, test plans and more planning 1.2 More Coding Defects since code and data trickier to debug 1.2 Nominal Nominal or lower level of defect introduction 1.0 Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

167 Main Storage Constraint (STOR) STOR level Requirements Design Code XH More Requirements understanding defects due to trickier analysis 1.08 More Design defects due to trickier analysis 1.18 More Coding defects since code and data trickier to debug 1.15 Nominal Nominal or lower level of defect introduction 1.0 Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

168 Platform Volatility (PVOL) PVOL level Requirements Design Code VH More Requirements defects due to changes in platform characteristics 1.16 More Design defects due to changes in platform characteristics 1.20 More Coding defects due to Requirements and Design shortfalls More Coding defects due to changes in platform characteristics 1.22 Nominal Nominal level of defect introduction L Fewer Requirements defects due to fewer changes in platform characteristics 1.0 Fewer Design defects due to fewer changes in platform characteristics Fewer Defects introduced in fixing defects Fewer Coding defects due to Requirements and Design shortfalls Fewer Coding defects due to fewer changes in platform characteristics Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

169 Analyst Capability (ACAP) ACAP level VH Nominal VL Requirements Design Code Fewer Requirements understanding defects Fewer Requirements Completeness, consistency defects 0.75 Nominal level of defect introduction More Requirements understanding defects More Requirements Completeness, consistency defects Fewer Requirements traceability defects Fewer Design Completeness, consistency defects Fewer defects introduced in fixing defects More Requirements traceability defects More Design Completeness, consistency defects More defects introduced in fixing defects 1.20 Fewer Coding defects due to requirements, design shortfalls -missing guidelines -ambiguities 0.90 More Coding defects due to requirements, design shortfalls -missing guidelines -ambiguities Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

170 Programmer Capability (PCAP) PCAP level VH Requirements Design Code Fewer Design defects due to easy interaction with analysts Fewer defects introduced in fixing defects Fewer Coding defects due to fewer detailed design reworks, conceptual misunderstandings, coding mistakes Nominal VL Nominal level of defect introduction 1.0 More Design defects due to less easy interaction with analysts More defects introduced in fixing defects 0.76 More Coding defects due to more detailed design reworks, conceptual misunderstandings, coding mistakes Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

171 Personnel Continuity (PCON) PCON level Requirements Design Code VH Fewer Requirements understanding defects Fewer Requirements completeness, consistency defects. Fewer Requirements defects due to false starts 0.82 Fewer Requirements traceability defects Fewer Design completeness, consistency defects Fewer defects introduced in fixing defects 0.80 Fewer Coding defects due to requirements, design shortfalls, misunderstandings -missing guidelines, context -ambiguities Fewer defects introduced in fixing defects 0.77 Nominal Nominal level of defect introduction VL More Requirements understanding defects More Requirements completeness, consistency defects. More Requirements defects due to false starts More Requirements traceability defects More Design completeness, consistency defects More defects introduced in fixing defects More Coding defects due to requirements, design shortfalls, misunderstandings -missing guidelines, context -ambiguities More defects introduced in fixing defects Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

172 Applications Experience (AEXP) AEXP level Requirements Design Code VH Fewer Requirements defects due to less learning and fewer false starts Fewer Requirements understanding defects Nominal VL 0.81 Nominal level of defect introduction More Requirements defects due to extensive learning and more false starts More Requirements understanding defects Fewer Design defects due to less learning and fewer false starts Fewer Requirements traceability defects Fewer defects introduced in fixing requirements, preliminary design fixes More Design defects due to less learning and fewer false starts More Requirements traceability defects More defects introduced in fixing requirements, preliminary design fixes 1.22 Fewer Coding defects due to less learning Fewer Coding defects due to requirements, design shortfalls 0.88 More Coding defects due to extensive learning More Coding defects due to requirements, design shortfalls Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

173 Platform Experience (PEXP) PEXP level Requirements Design Code VH Fewer Requirements defects due to fewer application/platform interface analysis misunderstandings Fewer Design defects due to fewer application/platform interface design misunderstandings Fewer Coding defects due to application/platform interface coding misunderstandings Nominal VL 0.90 Nominal level of defect introduction More Requirements defects due to more application/platform * interface, database, networking analysis misunderstandings More Design defects due to more application/platform interface design misunderstandings 0.86 More Coding defects due to application/platform interface coding misunderstandings Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range Platform can include office automation, database, and user-interface support packages; distributed middleware; operating systems; networking support; and hardware. 161

174 Language and Tool Experience (LTEX) LTEX level Requirements Design Code VH Fewer Requirements defects as easier to find and fix via tools Nominal VL 0.93 Nominal level of defect introduction More Requirements defects as difficult to find and fix via tools Fewer Design defects due to fewer design versus language mismatches and because defects are easier to find and fix via tools More Design defects due to more design versus language mismatches and because defects are difficult to find and fix via tools Fewer Coding defects due to fewer language misunderstandings and because defects are easier to find and fix via tools 0.82 More Coding defects due to more language misunderstandings and because defects are more difficult to find and fix via tools Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

175 Use of Software Tools (TOOL) TOOL level Requirements Design Code VH Fewer Requirements defects as easier to find and fix via tools 0.92 Fewer Design defects as easier to fix and find via tools 0.91 Fewer Coding defects as easier to find and fix via tools Fewer defects due to automation of translation of detailed design into code 0.80 Nominal Nominal level of defect introduction VL More Requirements defects as harder to find and fix via tools 1.0 More Design defects as harder to find and fix via tools More Coding defects as harder to find and fix via tools More defects due to manual translation of detailed design into code Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

176 Multisite Development (SITE) SITE level Requirements Design Code XH Fewer Requirements understanding, completeness and consistency defects due to shared context, good communication support Nominal VL 0.83 Nominal level of defect introduction More Requirements understanding, completeness and consistency defects due to lack of shared context, poor communication support Fewer Requirements traceability defects, Design completeness, consistency defects due to shared context, good communication support Fewer defects introduced in fixing defects More Requirements traceability defects, Design completeness, consistency defects due to lack of shared context, poor communication support More defects introduced in fixing defects 1.20 Fewer Coding defects due to requirements, design shortfalls and misunderstandings; shared coding context Fewer defects introduced in fixing defects 0.85 More Coding defects due to requirements, design shortfalls and misunderstandings; lack of shared context More defects introduced in fixing defects Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

177 Required Development Schedule (SCED) SCED level Requirements Design Code VH Fewer requirements defects due to higher likelihood of correctly interpreting specs Fewer defects due to more thorough planning, specs, validation Nominal VL 0.85 Nominal level of defect introduction More Requirements defects due to - more interface problems (more people in parallel) - more TBDs in specs, plans - less time for validation Fewer Design defects due to higher likelihood of correctly interpreting specs Fewer Design defects due to fewer defects in fixes and fewer specification defects to fix Fewer defects due to more thorough planning, specs, validation More Design defects due to earlier TBDs, more interface problems, less time for V&V More defects in fixes and more specification defects to fix 1.19 Fewer Coding defects due to higher likelihood of correctly interpreting specs Fewer Coding defects due to requirements and design shortfalls Fewer defects introduced in fixing defects 0.84 More Coding defects due to requirements and design shortfalls, less time for V&V More defects introduced in fixing defects Initial Quality Range Range - Round Median - Round Range - Round Final Quality Range

178 A.3 Delphi Results of Defect Removal Sub-Model of COQUALMO VL L N H VH XH Automated Analysis Round 1 Round 2 Median Range (min max) Median Range (min max) Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects VL L N H VH XH People Reviews Round 1 Round 2 Median Range (min max) Median Range (min max) Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects

179 VL L N H VH XH Execution Testing and Tools Round 1 Round 2 Median Range (min max) Median Range (min max) Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects Reqts defects Design defects Code defects

180 Appendix B: COCOMO 1 II and COQUALMO 2 Data Collection Questionnaire 1 Introduction The Center for Software Engineering at the University of Southern California is conducting research to update the software development cost estimation model called COCOMO. The project name is COCOMO II and is led by Dr. Barry W. Boehm. A fundamental requirement for such research is real-world software development project data. This data will be used to test hypotheses and verify the model s postulations. In return the model will be open and made available to the public. The contribution of your data will ensure the final model is useful. The data that is contributed is important to us. We will safeguard your contribution so as not to compromise company proprietary information. Some Affiliates have an active collection program and the data from past projects is available for the COCOMO II data collection effort. This questionnaire can be used to extract relevant COCOMO II data. A rosetta-stone that converts COCOMO 81 data to COCOMO II data is also available. Please contact us if you would like to get a copy. This questionnaire attempts to address two different levels of data granularity: project level and component level. The project level of granularity is data that is applicable for the whole project. This includes things like application type and 1 COnstructive COst MOdeling (COCOMO) is defined in Software Engineering Economics by Barry W. Boehm, Prentice Hall, COnstructive QUALity MOdel (COQUALMO) is the quality model extension to COCOMOII. For more details prefer to 168

181 development activity being reported. Component level data are things like size, cost, and component cost drivers. If the data being submitted is on a project that has multiple components then fill out the project data once, and the component data for each of the identifiable component. If the data being submitted is for the whole project fill out the form once. The data collection activity for the COCOMO II research effort started in November The first calibration was published in 1997 based on 83 datapoints collected. It became popular as COCOMO II.1997 and produced estimates within 30% of the actuals 52% of the time for effort. The second calibration was published in 1998 based on 161 datapoints. It is known as COCOMO II.1998 and produces estimates within 30% of the actuals 75% of the time for effort. The aim of the COCOMO II research team is to continually update the existing COCOMO II database and to publish annual calibrations of the COCOMO II model. Hence by submitting your data to us, you play a significant role in the model calibration. COCOMO II Points of Contact For questions on the COCOMO II Model and its extensions, data definitions, or project data collection and management, contact: Sunita Chulani (Research Assistant) Voice: (213) , Fax: (213) Barry Boehm (Project Leader) Voice: (213) , Fax: (213) Internet Electronic-Mail cocomo-info@sunset.usc.edu 169

182 COCOMO II Data Submission Address: COCOMO II Data Submission Center for Software Engineering Department of Computer Science Henry Salvatori Room 330 University of Southern California 941 W. 37th Place Los Angeles, CA U.S.A. 2 Project Level Information As described in the Introduction section of this questionnaire, project level information is applicable for the whole project. This includes things like application type and development activity being reported. 2.A General Information 2.1 Affiliate Identification Number. Each separate software project contributing data will have a separate file identification number of the form XXX. XXX will be one of a random set of three-digit organization identification numbers, provided by USC Center for Software Engineering to the Affiliate. 2.2 Project Identification Number. The project identification is a three digit number assigned by the organization. Only the Affiliate knows the correspondence between YYY 170

183 and the actual project. The same project identification must be used with each data submission. 2.3 Date prepared. This is the date the data elements were collected for submission. 2.4 Application Type. This field captures a broad description of the type of activity this software application is attempting to perform. Circle One: Command and Control, MIS, Simulation, Communication, Operating Systems, Software Dev. Tools, Diagnostics, Process Control, Testing, Engineering Signal processing, Utilities and Science Other: 2.5 Development Type. Is the development a new software product or an upgrade of an existing product? Circle One: New Product Upgrade 2.6 Development Process. This is a description of the software process used to control the software development, e.g. waterfall, spiral, etc. 171

184 2.7 Step in Process. This field captures information about the project s position in its development process. The answers depend on the process model being followed Activity. This field captures the waterfall phase of development that the project is in. For one-time reporting the activity is `completed'. It is assumed that data for completed projects includes data from software requirements through integration/test. Please report the correct phasing if this is not the case. Circle One: Requirements, Design, Code, Unit Test, Integration/Test, Maintenance, Completed Other: Stage. 3 Stage refers to the aggregate of activities between the life cycle anchor points. The four stages, based on Rational s Objectory Process 4, and the anchor points are shown on the timeline below. Please circle the most advance anchor point (milestone) the project has achieved. Life Cycle Objectives Life Cycle Architecture Inception Elaboration Construction Initial Operational Capability Maintenance 3 Barry W. Boehm, Anchoring the Software Process, IEEE Software, 13, 4, July 1996, pp Rational Corp., Rational Objectory Process 4.1 Your UML Process, available at 172

185 The COCOMO II model covers the effort required from the completion of the LCO to IOC. If you are using a waterfall model, the corresponding milestones are the Software Requirements Review, Preliminary Design Review, and Software Acceptance Test. 2.8 Development Process Iteration. If the process is iterative, e.g. spiral, which iteration is this? 2.9 COCOMO Model. This specifies which COCOMO II model is being used in this data submission. If this is a "historical" data submission, select the Post-Architecture model or the Applications Composition model. Application Composition: This model involves prototyping efforts to resolve potential high risk issues such as user interfaces, software/system interaction, performance, or technology maturity. Early Design: This model involves exploration of alternative software/system architectures and concepts of operations. At this stage of development, not enough is known to support fine-grain cost estimation. Post-Architecture: This model involves the actual development and maintenance of a software product. This stage of development proceeds most cost-effectively if a software life-cycle architecture has been developed; validated with respect to the system s mission, concept of operation, and risk; and established as the framework for the product. Circle One: Application Composition, Early Design, Post-Architecture 173

186 2.10 Success Rating for Project. This specifies the degree of success for the project. Very successful; did almost everything right Successful; did the big things right OK; stayed out of trouble Some Problems; took some effort to keep viable Major Problems; would not do this project again Circle One: Very Successful Successful OK Some Problems Major Problems Schedule 2.11 Year of development. For reporting of historical data, please provide the year in which the software development was completed. For periodic reporting put the year of this submission or leave blank Schedule Months. For reporting of historical data, provide the number of calendar months from the time the development began through the time it completed. For periodic reporting, provide the number of months in this development activity. 174

187 Circle the life-cycle stages that the schedule covers: Life Cycle Objectives Life Cycle Architecture Inception Elaboration Construction Initial Operational Capability Maintenance Schedule in months: Project Exponential Cost Drivers Scale Factors (Wi) Precedentedness Development Flexibility Very Low Low Nominal High Very High thoroughly largely somewhat generally largely unprecedented unprecedented unprecedented familiar familiar rigorous occasional some general Some relaxation relaxation conformity conformity mostly Architecture / risk resolution a. little (20%) some (40%) often (60%) generally (75%) Team cohesion some difficult basically largely highly very difficult interactions cooperative cooperative cooperative interactions interactions (90%) Seamless interac0 tions Extra High Thorohly familiar general goals full (100%) a. % significant module interfaces specified, % significant risks eliminated. Enter the rating level for the first four cost drivers Precedentedness (PREC). If the product is similar to several that have been developed before then the precedentedness is high. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Extra High Don t Know 175

188 2.14 Development Flexibility (FLEX). This cost driver captures the amount of constraints the product has to meet. The more flexible the requirements, schedules, interfaces, etc., the higher the rating. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Extra High Don t Know 2.15 Architecture / Risk Resolution (RESL). This cost driver captures the thoroughness of definition and freedom from risk of the software architecture used for the product. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Extra High Don t Know 2.16 Team Cohesion (TEAM). The Team Cohesion cost driver accounts for the sources of project turbulence and extra effort due to difficulties in synchronizing the project s stakeholders: users, customers, developers, maintainers, interfacers, others. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Extra High Don t Know 2.17 Process Maturity (PMAT). The procedure for determining PMAT is organized around the Software Engineering Institute s Capability Maturity Model (CMM). The time 176

189 period for reporting process maturity is at the time the project was underway. We are interested in the capabilities practiced at the project level more than the overall organization s capabilities. There are three ways of responding to this question: choose only one. "Key Process Area Evaluation" requires a response for each Key Process Area (KPA). We have provided enough information for you to self-evaluate the project s enactment of a KPA (we hope will you will take the time to complete this section). "Overall Maturity Level" is a response that captures the result of an organized evaluation based on the CMM. "No Response" means you do not know or will not report the process maturity either at the Capability Maturity Model or Key Process Area level. No Response Overall Maturity Level CMM Level 1 (lower half) CMM Level 1 (upper half) CMM Level 2 CMM Level 3 CMM Level 4 CMM Level 5 Basis of estimate: Software Process Assessment (SPA) Software Capability Evaluation (SCE) Interim Process Assessment (IPA) Other: 177

190 Key Process Area Evaluation Enough information is provided in the following table so that you can assess the degree to which a KPA was exercised on the project. Almost Always (over 90% of the time) when the goals are consistently achieved and are well established in standard operating procedures. Frequently (about 60 to 90% of the time) when the goals are achieved relatively often, but sometimes are omitted under difficult circumstances. About Half (about 40 to 60% of the time) when the goals are achieved about half of the time. Occasionally (about 10 to 40% of the time) when the goals are sometimes achieved, but less often. Rarely If Ever (less than 10% of the time) when the goals are rarely if ever achieved. Does Not Apply when you have the required knowledge about your project or organization and the KPA, but you feel the KPA does not apply to your circumstances (e.g. Subcontract Management). Don t Know when you are uncertain about how to respond for the KPA. 178

191 Key Process Area Requirements Management: involves establishing and maintaining an agreement with the customer on the requirements for the software project. Software Project Planning: establishes reasonable plans for performing the software engineering activities and for managing the software project. Software Project Tracking and Oversight: provides adequate visibility into actual progress so that management can take corrective actions when the software project s performance deviates significantly from the software plans. Goals of each KPA System requirements allocated to software are controlled to establish a baseline for software engineering and management use. Software plans, products, and activities are kept consistent with the system requirements allocated to software. Software estimates are documented for use in planning and tracking the software project. Software project activities and commitments are planned and documented. Affected groups and individuals agree to their commitments related to the software project. Actual results and performances are tracked against the software plans. Corrective actions are taken and managed to closure when actual results and performance deviate significantly from the software plans. Changes to software commitments are agreed to by the affected groups and individuals. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 179

192 Key Process Area Software Subcontract Management: involves selecting a software subcontractor, establishing commitments with the subcontractor, and tracking and reviewing the subcontractor s performance and results. Software Quality Assurance: provides management with appropriate visibility into the process being used by the software project and of the products being built. Goals of each KPA The prime contractor selects qualified software subcontractors. The prime contractor and the software subcontractor agree to their commitments to each other. The prime contractor and the software subcontractor maintain ongoing communications. The prime contractor tracks the software subcontractor s actual results and performance against its commitments. Software quality assurance activities are planned. Adherence of software products and activities to the applicable standards, procedures, and requirements is verified objectively. Affected groups and individuals are informed of software quality assurance activities and results. Noncompliance issues that cannot be resolved within the software project are addressed by senior management. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 180

193 Key Process Area Software Configuration Management: establishes and maintains the integrity of the products of the software project throughout the project s software life cycle. Organization Process Focus: establishes the organizational responsibility for software process activities that improve the organization s overall software process capability. Organization Process Definition: develops and maintains a usable set of software process assets that improve process performance across the projects and provides a basis for cumulative, long- term benefits to the organization. Goals of each KPA Software configuration management activities are planned. Selected software work products are identified, controlled, and available. Changes to identified software work products are controlled. Affected groups and individuals are informed of the status and content of software baselines. Software process development and improvement activities are coordinated across the organization. The strengths and weaknesses of the software processes used are identified relative to a process standard. Organization-level process development and improvement activities are planned. A standard software process for the organization is developed and maintained. Information related to the use of the organization s standard software process by the software projects is collected, reviewed, and made available. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 181

194 Key Process Area Training Program: develops the skills and knowledge of individuals so they can perform their roles effectively and efficiently. Integrated Software Management: integrates the software engineering and management activities into a coherent, defined software process that is tailored from the organization s standard software process and related process assets. Software Product Engineering: integrates all the software engineering activities to produce and support correct, consistent software products effectively and efficiently. Goals of each KPA Training activities are planned. Training for developing the skills and knowledge needed to perform software management and technical roles is provided. Individuals in the software engineering group and softwarerelated groups receive the training necessary to perform their roles. The project s defined software process is a tailored version of the organization s standard software process. The project is planned and managed according to the project s defined software process. The software engineering tasks are defined, integrated, and consistently performed to produce the software. Software work products are kept consistent with each other. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 182

195 Key Process Area Intergroup Coordination: establishes a means for the software engineering group to participate actively with the other engineering groups so the project is better able to satisfy the customer s needs effectively and efficiently. Peer Review: removes defects from the software work products early and efficiently. Goals of each KPA The customer s requirements are agreed to by all affected groups. The commitments between the engineering groups are agreed to by the affected groups. The engineering groups identify, track, and resolve intergroup issues. Peer review activities are planned. Defects in the software work products are identified and removed. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 183

196 Key Process Area Quantitative Process Management: controls the process performance of the software project quantitatively. Software Quality Management: involves defining quality goals for the software products, establishing plans to achieve these goals, and monitoring and adjusting the software plans, software work products, activities, and quality goals to satisfy the needs and desires of the customer and end user. Defect Prevention: analyzes defects that were encountered in the past and takes specific actions to prevent the occurrence of those types of defects in the future. Goals of each KPA The quantitative process management activities are planned. The process performance of the project s defined software process is controlled quantitatively. The process capability of the organization s standard software process is known in quantitative terms. The project s software quality management activities are planned. Measurable goals for software product quality and their priorities are defined. Actual progress toward achieving the quality goals for the software products is quantified and managed. Defect prevention activities are planned. Common causes of defects are sought out and identified. Common causes of defects are prioritized and systematically eliminated. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 184

197 Key Process Area Technology Change Management: involves identifying, selecting, and evaluating new technologies, and incorporating effective technologies into the organization. Process Change Management: involves defining process improvement goals and, with senior management sponsorship, proactively and systematically identifying, evaluating, and implementing improvements to the organization s standard software process and the projects defined software processes on a continuous basis. Goals of each KPA Incorporation of technology changes are planned. New technologies are evaluated to determine their effect on quality and productivity. Appropriate new technologies are transferred into normal practice across the organization. Continuous process improvement is planned. Participation in the organization s software process improvement activities is organization wide. The organization s standard software process and the projects defined software processes are improved continuously. Almost Always Very Often About Half Some Times Rarely If Ever Does Not Don t Know 185

198 2.B. COQUALMO (CONSTRUCTIVE QUALITY MODEL) This subsection has additional questions related to the COQUALMO Severity of Defects. Categorize the several defects based on their severity using the following classification 5 information. Note that only Critical, High and Medium severity defects are accounted for in COQUALMO. Critical Causes a system crash or unrecoverable data loss or jeopardizes personnel. The product is unusable (and in mission/safety software would prevent the completion of the mission). High Causes impairment of critical system functions and no workaround solution exists. Some aspects of the product do not work (and the defect adversely affects successful completion of mission in mission/safety software), but some attributes do work in its current situation. Medium Causes impairment of critical system function, though a workaround solution does exist. The product can be used, but a workaround (from a customer s preferred method of operation) must be used to achieve some capabilities. The presence of medium priority defects usually degrades the work. 5 Adapted from IEEE-Std

199 Low Causes a low level of inconvenience or annoyance. The product meets its requirements and can be used with just a little inconvenience. Typos in displays such as spelling, punctuation, and grammar which generally do not cause operational problems are usually categorized as low severity. None Concerns a duplicate or completely trivial problem, such as a minor typo in supporting documentation Defect Introduction by Artifact, Stage and Severity. The software development process can be viewed as introducing a certain number of defects into each software product artifact. Stage refers to the aggregate of activities between the life cycle anchor points. The four stages, based on Rational s UML Process, and the anchor points are shown on the timeline below. Enter the number of defects introduced in the several artifacts involved in the software development process. A Requirements Defect is a defect introduced in the Requirements Activity and a Design Defect is a defect introduced in the Design activity and so on and so forth. Stage Artifact Requirements Design LCO LCA IOC Inception Elaboration Construction Code 187

200 Requirements Defects Severity Urgent High Medium Low None No. of Requirements Defects Design Defects Severity Urgent High Medium Low None No. of Design Defects Code Defects Severity Urgent High Medium Low None No. of Code Defects 2.20 Defect Removal by Artifact, Stage and Capability Throughout the development life cycle, defect removal techniques are incorporated to eliminate defects before the product is delivered. Enter the number of defects removed in 7each stage of the software development process. LCO LCA IOC Stage Inception Elaboration Construction Artifact Requirements Design Code COQUALMO models defect removal by classifying defect removal capabilities into three relatively orthogonal profiles with each profile having 6 levels of increasing 188

201 defect removal capability, namely Very Low, Low, Nominal, High, Very High and Extra High with Extra High being the most effective in defect removal Automated Analysis Rating Scale Your rating Very Low Simple compiler syntax checking Low Nominal High Very High Extra High Don t know Basic compiler capabilities for static modulelevel code analysis, syntax, typechecking Some compiler extensions for static module and inter-module level code analysis, syntax, typechecking. Basic requirements and design consistency, traceability checking Intermediatelevel module and intermodule code syntax and semantic analysis. Simple requirements/ design view consistency checking More elaborate requiremen ts/design view consistency checking. Basic distributedprocessing and temporal analysis, model checking, symbolic execution Formalized* specification and verification. Advanced distributed processing and temporal analysis, model checking, symbolic execution *Consistencycheckable preconditions and postconditions, but not mathematical theorems. 189

202 Peer Reviews Rating Scale Your rating Very Low No peer review Low Nominal High Very High Extra High Don t know Ad-hoc informal walkthroughs Minimal preparation, no follow-up Welldefined sequence of preparation, review, minimal follow-up. Informal review roles and procedures Formal review roles with all participants welltrained and procedures applied to all products using basic checklists, follow up. Formal review roles with all participants well-trained and procedures applied to all product artifacts & changes (formal change control boards). Basic review checklists, root cause analysis. Formal follow-up. Use of historical data on inspection rate, preparation rate, fault density. Formal review roles and procedures for fixes, change control. Extensive review checklists, root cause analysis. Continuous review process improvement User/Customer involvement, Statistical Process Control 190

203 Execution Testing and Tools Rating Scale Your rating Very Low No testing Low Nominal High Very High Extra High Don t know Ad-hoc testing and debugging. Basic textbased debugger Basic unit test, integration test, system test process Basic test data management, problem tracking support Test criteria based on checklists Well-defined test sequence tailored to organization (acceptance / alpha / beta / flight / etc.) test Basic test coverage tools, test support system Basic test process management More advanced test tools, test data preparation, basic test oracle support, distributed monitoring and analysis, assertion checking Metricsbased test process management Highly advanced tools for test oracles, distributed monitoring and analysis, assertion checking Integration of automated analysis and test tools Modelbased test process management 2.C Distribution of Effort and Schedule By Stage This subsection has additional metrics that are required to calibrate the distribution of effort and schedule by stage. Please fill this out if the necessary information is available. 191

204 2.21 Total Effort (Person Months). Divide the total effort required for the project into effort (in Person Months) required for each of the following three stages: Inception, Elaboration and Construction. Effort Distribution LCO LCA IOC Inception Elaboration Construction 2.22 Schedule Months. Divide the total time for development (schedule) required for the project into schedule (in Calendar Months) required for each of the following three stages: Inception, Elaboration and Construction. Schedule Distribution LCO LCA IOC Inception Elaboration Construction 3 Component Level Information Component ID If the whole project is being reported as a single component then skip to the next section. If the data being submitted is for multiple components that comprise a single project then it is necessary to identify each component with its project. Please fill out this section for each component and attach all of the component sections to the project sections describing the overall project data. 3.1 Affiliate Identification Number. Each separate software project contributing data will have a separate file identification number of the form XXX. XXX will be one of a 192

205 random set of three-digit organization identification numbers, provided by USC Center for Software Engineering to the Affiliate. 3.2 Project Identification Number. The project identification is a three digit number assigned by the organization. Only the Affiliate knows the correspondence between YYY and the actual project. The same project identification must be used with each data submission. 3.3 Component Identification (if applicable). This is a unique sequential letter that identifies a software module that is part of a project. Circle One: A B C D E F G H I J K L M N O P Q R Cost 3.4 Total Effort (Person Months). Circle the life-cycle stages that the effort estimate covers: Life Cycle Objectives Life Cycle Architecture Inception Elaboration Construction Initial Operational Capability Maintenance Total Effort 193

206 3.5 Hours / Person Month. Indicate the average number of hours per person month experienced by your organization. 3.6 Labor Breakout. Indicate the percentage of labor for different categories,e.g. Managers, S/W Requirement Analysts, Designers, CM/QA Personnel, Programmers, Testers, and Interfacers for each stage of software development: Categories Stage Inception Elaboration Construction Rqts. Design Code Test Management CM, QA, Documentation Size The project would like to collect size in object points, logical lines of code, and unadjusted function points. Please submit all size measures that are available, e.g. if you have a component in lines of code and unadjusted function points then submit both numbers. 3.7 Percentage of Code Breakage. This is an estimate of how much the requirements have changed over the lifetime of the project. It is the percentage of code thrown away due to requirements volatility. For example, a project which delivers 100,000 instructions but 194

207 discards the equivalent of an additional 20,000 instructions would have a breakage of value of 20. See the Model Definition Manual for more detail. 3.8 Object Points. If the COCOMO II Applications Programming model was used then enter the object point count. 3.9 New Unique SLOC. This is the number of new source lines of code (SLOC) generated SLOC Count Type. When reporting size in source lines of code, please indicate if the count was for logical SLOC or physical SLOC. If both are available, please submit both types of counts. If neither type of count applies to the way the code was counted, please describe the method. An extensive definition for logical source lines of code is given in an Appendix in the Model Definition Manual. Circle One: Logical SLOC Physical SLOC (carriage returns) Physical SLOC (semicolons) Non-Commented/Non-Blank SLOC Other: 195

208 3.11 Unadjusted Function Points. If you are using the Early Design or Post-Architecture model, provide the total Unadjusted Function Points for each type. An Unadjusted Function Point is the product of the function point count and the weight for that type of point. Function Points are discussed in the Model Definition Manual Programming Language. If you are using the Early Design or Post-Architecture model, enter the language name that was used in this component, e.g. Ada, C, C++, COBOL, FORTRAN and the amount of usage if more than one language was used. Language Used Percentage Used 3.13 Software Maintenance Parameters. For software maintenance, use items to describe the size of the base software product, and use the same units to describe the following parameters: a. Amount of software added: b. Amount of software modified: c. Amount of software deleted: 196

209 3.14 Object Points Reused. If you are using the Application Composition model, enter the number of object points reused. Do not fill in the fields on DM, CM, IM, SU, or AA ASLOC Adapted. If you are using the Early Design or Post-Architecture model enter the amounts for the SLOC adapted ASLOC Count Type. When reporting size in source lines of code, please indicate if the count was for logical ASLOC or physical ASLOC. If both are available, please submit both types of counts. If neither type of count applies to the way the code was counted, please describe the method. An extensive definition for logical source lines of code is given in an Appendix in the Model Definition Manual. Circle One: Logical ASLOC Physical ASLOC (carriage returns) Physical ASLOC (semicolons) Non-Commented/Non-Blank ASLOC Other: 3.17 Design Modified - DM. The percentage of design modified Code Modified - CM. The percentage of code modified. 197

210 3.19 Integration and Test - IM. The percentage of the adapted software s original integration & test effort expended Software Understanding - SU. Table 1: Rating Scale for Software Understanding Increment SU Very Low Low Nom High Very High Structure Very low cohesion, high coupling, spaghetti code. Moderately low cohesion, high coupling. Reasonably well-structured; some weak areas. High cohesion, low coupling. Strong modularity, information hiding in data / control structures. Application Clarity No match between program and application world views. Some correlation between program and application. Moderate correlation between program and application. Good correlation between program and application. Clear match between program and application world-views. Self- Descriptiveness SU Increment to ESLOC Obscure code; documentation missing, obscure or obsolete Some code commentary and headers; some useful documentation. Moderate level of code commentary, headers, documentations. Good code commentary and headers; useful documentation; some weak areas. Selfdescriptive code; documentati on up-todate, wellorganized, with design rationale The Software Understanding increment (SU) is obtained from Table 1. SU is expressed quantitatively as a percentage. If the software is rated very high on structure, applications clarity, and self-descriptiveness, the software understanding and interface checking penalty is 10%. If the software is rated very low on these factors, the penalty is 50%. SU 198

211 is determined by taking the subjective average of the three categories. Enter the percentage of SU: 3.21 Assessment and Assimilation - AA. Table 2: Rating Scale for Assessment and Assimilation Increment (AA) AA Increment Level of AA Effort 0 None 2 Basic module search and documentation 4 Some module Test and Evaluation (T&E), documentation 6 Considerable module T&E, documentation 8 Extensive module T&E, documentation The other nonlinear reuse increment deals with the degree of Assessment and Assimilation (AA) needed to determine whether a fully-reused software module is appropriate to the application, and to integrate its description into the overall product description. Table 2 provides the rating scale and values for the assessment and assimilation increment. Enter the percentage of AA: 3.22 Programmer Unfamiliarity - UNFM. Table 3: Rating Scale for Programmer Unfamiliarity (UNFM) UNFM Increment Level of Unfamiliarity 0.0 Completely familiar 0.2 Mostly familiar 0.4 Somewhat familiar 0.6 Considerably familiar 0.8 Mostly unfamiliar 1.0 Completely unfamiliar 199

212 The amount of effort required to modify existing software is a function not only of the amount of modification (AAF) and understandability of the existing software (SU), but also of the programmer s relative unfamiliarity with the software (UNFM). The UNFM parameter is applied multiplicatively to the software understanding effort increment. If the programmer works with the software every day, the 0.0 multiplier for UNFM will add no software understanding increment. If the programmer has never seen the software before, the 1.0 multiplier will add the full software understanding effort increment. The rating of UNFM is in Table 3. Enter the Level of Unfamiliarity: Post-Architecture Effort Multipliers. These are the 17 effort multipliers used in the COCOMO II Post-Architecture model used to adjust the nominal effort, Person Months, to reflect the software product under development. They are grouped into four categories: product, platform, personnel, and project. 200

213 Product Cost Drivers. For maintenance projects, identify any differences between the base code and modified code Product Cost Drivers (e.g. complexity). RELY Very Low Low Nominal High Very High slight inconvenience low, easily recoverable losses Moderate, easily recoverable losses high financial loss risk to human life Extra High DATA DB bytes/pgm SLOC < < D/P < < D/P < 1000 D/P > 1000 RUSE none across project across program DOCU Many life-cycle needs uncovered Some lifecycle needs uncovered. Right-sized to lifecycle needs Excessive for lifecycle needs across product line Very excessive for lifecycle needs across multiple product lines 3.23 Required Software Reliability (RELY). This is the measure of the extent to which the software must perform its intended function over a period of time. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 3.24 Data Base Size (DATA). This measure attempts to capture the affect large data requirements have on product development e.g. testing. The rating is determined by 201

214 calculating D/P, where D is the number of bytes of data and P is the number of SLOCS. See the Model Definition Manual for more details. Low Nominal High Very High Don t Know 3.25 Develop for Reuse (RUSE). This cost driver accounts for the additional effort needed to construct components intended for reuse on the current or future projects. See the Model Definition Manual for more details. Low Nominal High Very High Don t Know 3.26 Documentation match to life-cycle needs (DOCU). This captures the suitability of the project s documentation to its life-cycle needs. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 202

215 3.27 Product Complexity (CPLX): Control Operations Computational Operations Devicedependent Operations Data Management Operations User Interface Management Operations Very Low Straight-line code with a few nonnested structured programming operators: DOs, CASEs, IFTHENELSEs. Simple module composition via procedure calls or simple scripts. Evaluation of simple expressions: e.g., A=B+C*(D-E) Simple read, write statements with simple formats. Simple arrays in main memory. Simple COTS- DB queries, updates. Simple input forms, report generators. Low Straightforward nesting of structured programming operators. Mostly simple predicates Evaluation of moderate-level expressions: e.g., D=SQRT(B**2-4.*A*C) No cognizance needed of particular processor or I/O device characteristics. I/O done at GET/PUT level. Single file subsetting with no data structure changes, no edits, no intermediate files. Moderately complex COTS-DB queries, updates. Use of simple graphic user interface (GUI) builders. Nominal Mostly simple nesting. Some intermodule control. Decision tables. Simple callbacks or message passing, including middlewaresupported distributed processing Use of standard math and statistical routines. Basic matrix/vector operations. I/O processing includes device selection, status checking and error processing. Multi-file input and single file output. Simple structural changes, simple edits. Complex COTS-DB queries, updates. Simple use of widget set. High Highly nested structured programming operators with many compound predicates. Queue Basic numerical analysis: multivariate interpolation, ordinary differential equations. Basic truncation, roundoff concerns. Operations at physical I/O level (physical storage address translations; seeks, reads, Simple triggers activated by data stream contents. Complex data Widget set development and extension. Simple voice I/O, 203

216 Control Operations Computational Operations Devicedependent Operations Data Management Operations User Interface Management Operations and stack control. Homogeneous, distributed processing. Single processor soft realtime control. etc.). Optimized I/O overlap. restructuring. multimedia. Very High Reentrant and recursive coding. Fixed-priority interrupt handling. Task synchronization, complex callbacks, heterogeneous distributed processing. Singleprocessor hard real-time control. Difficult but structured numerical analysis: near-singular matrix equations, partial differential equations. Simple parallelization. Routines for interrupt diagnosis, servicing, masking. Communication line handling. Performanceintensive embedded systems. Distributed database coordination. Complex triggers. Search optimization. Moderately complex 2D/3D, dynamic graphics, multimedia. Extra High Multiple resource scheduling with dynamically changing priorities. Microcode-level control. Distributed hard real-time control. Difficult and unstructured numerical analysis: highly accurate analysis of noisy, stochastic data. Complex parallelization. Device timingdependent coding, microprogrammed operations. Performancecritical embedded systems. Highly coupled, dynamic relational and object structures. Natural language data management. Complex multimedia, virtual reality. Complexity is divided into five areas: control operations, computational operations, device-dependent operations, data management operations, and user interface management operations. Select the area or combination of areas that characterize the product or a sub-system of the product. The complexity rating is the subjective weighted average of these areas. Very Low Low Nominal High Very High Extra High Don t Know 204

217 Platform Cost Drivers The platform refers to the target-machine complex of hardware and infrastructure software. Very Low Low Nominal High Very High Extra High TIME = 50% use of available execution time 70% 85% 95% STOR =50% use of available storage 70% 85% 95% PVOL major change every 12 mo.; minor change every 1 mo. major: 6 mo.; minor: 2 wk. major: 2 mo.; minor: 1 wk. major: 2 wk.; minor: 2 days 3.28 Execution Time Constraint (TIME). This is a measure of the execution time constraint imposed upon a software system. See the Model Definition Manual for more details. Nominal High Very High Extra High Don t Know 3.29 Main Storage Constraint (STOR). This rating represents the degree of main storage constraint imposed on a software system or subsystem. See the Model Definition Manual for more details. Nominal High Very High Extra High Don t Know 205

218 3.30 Platform Volatility (PVOL). "Platform" is used here to mean the complex of hardware and software (OS, DBMS, etc.) the software product calls on to perform its tasks. See the Model Definition Manual for more details. Low Nominal High Very High Don t Know Personnel Cost Drivers. Very Low Low Nominal High Very High ACAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile PCAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile PCON 48% / year 24% / year 12% / year 6% / year 3% / year AEXP 2 months 6 months 1 year 3 years 6 years PEXP 2 months 6 months 1 year 3 years 6 years LTEX 2 months 6 months 1 year 3 years 6 years 3.31 Analyst Capability (ACAP). Analysts are personnel that work on requirements, high level design and detailed design. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 3.32 Programmer Capability (PCAP). Evaluation should be based on the capability of the programmers as a team rather than as individuals. Major factors which should be considered in the rating are ability, efficiency and thoroughness, and the ability to communicate and cooperate. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 206

219 3.33 Personnel Continuity (PCON). The rating scale for PCON is in terms of the project s annual personnel turnover. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 3.34 Applications Experience (AEXP). This rating is dependent on the level of applications experience of the project team developing the software system or subsystem. The ratings are defined in terms of the project team s equivalent level of experience with this type of application. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 3.35 Platform Experience (PEXP). The Post-Architecture model broadens the productivity influence of PEXP, recognizing the importance of understanding the use of more powerful platforms, including more graphic user interface, database, networking, and distributed middleware capabilities. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 3.36 Language and Tool Experience (LTEX). This is a measure of the level of programming language and software tool experience of the project team developing the software system or subsystem. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 207

220 Project Cost Drivers. This table gives a summary of the criteria used to select a rating level for project cost drivers. Very Low Low Nominal High Very High Extra High TOOL SITE: Collocation SITE: Communications SCED edit, code, debug simple, frontend, backend CASE, little integration International Multi-city and Multicompany Some phone, mail 75% of nominal Individual phone, FAX 85% of nominal basic lifecycle tools, moderately integrated Multi-city or Multicompany Narrowband 100% of nominal strong, mature lifecycle tools, moderately integrated Same city or metro area Wideband electronic communications 130% of nominal strong, mature, proactive lifecycle tools, well integrated with processes, methods, reuse Same building or complex Wideband electronic communications, occasional video conferencing. 160% of nominal Fully collocated Interactive multimedia 3.37 Use of Software Tools (TOOL). See the Model Definition Manual. Very Low Low Nominal High Very High Don t Know 3.38 Multisite Development (SITE). Given the increasing frequency of multisite developments, and indications that multisite development effects are significant, the SITE cost driver has been added in COCOMO II. Determining its cost driver rating involves the assessment and averaging of two factors: site collocation (from fully collocated to 208

221 international distribution) and communication support (from surface mail and some phone access to full interactive multimedia). See the Model Definition Manual for more details. Very Low Low Nominal High Very High Extra High Don t Know 3.39 Required Development Schedule (SCED). This rating measures the schedule constraint imposed on the project team developing the software. The ratings are defined in terms of the percentage of schedule stretch-out or acceleration with respect to a nominal schedule for a project requiring a given amount of effort. See the Model Definition Manual for more details. Very Low Low Nominal High Very High Don t Know 209

Appendix C: Summary of COCOMO II Data Appendix C provides the summary of the data used for the Bayesian calibration of COCOMO II.1999.

222 Appendix C: Summary of COCOMO II Data Appendix C provides the summary of the data used for the Bayesian calibration of COCOMO II It illustrates the distribution of the 161 datapoints that compose the COCOMO II.1999 dataset in terms of histograms for each of the five scale factors and 17 effort multipliers of the COCOMO II Post Architecture model. 210

223 211

224 Data Base Size (DATA) 212

225 213

DRAFT. Effort = A * Size B * EM. (1) Effort in person-months A - calibrated constant B - scale factor EM - effort multiplier from cost factors

DRAFT. Effort = A * Size B * EM. (1) Effort in person-months A - calibrated constant B - scale factor EM - effort multiplier from cost factors 1.1. Cost Estimation Models Parametric cost models used in avionics, space, ground, and shipboard platforms by the services are generally based on the common effort formula shown in Equation 1. Size of