Elsevier Editorial System(tm) for Information and Software Technology Manuscript Draft

Size: px
Start display at page:

Download "Elsevier Editorial System(tm) for Information and Software Technology Manuscript Draft"

Transcription

1 Elsevier Editorial System(tm) for Information and Software Technology Manuscript Draft Manuscript Number: INFSOF-D Title: A Controlled Experiment in Assessing and Estimating Software Maintenance Tasks Article Type: Special issue article Keywords: software maintenance; software estimation; maintenance experiment; COCOMO; maintenance size Corresponding Author: Vu Nguyen, Corresponding Author's Institution: University of Southern California First Author: Vu Nguyen, PhD Order of Authors: Vu Nguyen, PhD; Barry Boehm, PhD; Phongphan Danphitsanuphan, Master's Abstract: Context: Software maintenance is an important software engineering activity that has been reported to account for the majority of the software total cost. Thus, understanding the factors that influence the cost of software maintenance tasks helps maintainers to make informed decisions about their work. Objective: This paper describes a controlled experiment of student programmers performing maintenance tasks on a C++ program. The objective of the study is to assess the maintenance size, effort, and effort distributions of three different maintenance types and to describe estimation models to predict the programmer's effort spent on maintenance tasks. Method: Twenty three graduate students and a senior majoring in computer science participated in the experiment. Each student was asked to perform maintenance tasks required for one of the three task groups. The impact of different LOC metrics on maintenance effort was also evaluated by fitting the data collected into four estimation models. Results: The results indicate that corrective maintenance is much less productive than enhancive and reductive maintenance and program comprehension activities require as much as 50% of the total effort in corrective maintenance. Moreover, the best software effort model can estimate the time of 79% of the programmers with the error of or less than 30%. Conclusions: Our study suggests that the LOC added, modified, and deleted metrics are good predictors for estimating the cost of software maintenance. Effort estimation models for maintenance work may use the LOC added, modified, deleted metrics as the independent parameters instead of the simple sum of the three. Another implication is that reducing business rules of the software requires a sizable proportion of the software maintenance effort. Finally, the differences in effort distribution among the maintenance types suggest that assigning maintenance tasks properly is important to effectively and efficiently utilize human resources.

2 Manuscript Click here to view linked References A Controlled Experiment in Assessing and Estimating Software Maintenance Tasks Vu Nguyen *,a, Barry Boehm a, Phongphan Danphitsanuphan b a Computer Science Department, University of Southern California, Los Angeles, USA b Computer Science Department, King Mongkut's University of Technology North Bangkok, Bangkok, Thailand Abstract Context: Software maintenance is an important software engineering activity that has been reported to account for the majority of the software total cost. Thus, understanding the factors that influence the cost of software maintenance tasks helps maintainers to make informed decisions about their work. Objective: This paper describes a controlled experiment of student programmers performing maintenance tasks on a C++ program. The objective of the study is to assess the maintenance size, effort, and effort distributions of three different maintenance types and to describe estimation models to predict the programmer s effort spent on maintenance tasks. Method: Twenty three graduate students and a senior majoring in computer science participated in the experiment. Each student was asked to perform maintenance tasks required for one of the three task groups. The impact of different LOC metrics on maintenance effort was also evaluated by fitting the data collected into four estimation models. Results: The results indicate that corrective maintenance is much less productive than enhancive and reductive maintenance and program comprehension activities require as much as 50% of the total effort in corrective maintenance. Moreover, the best software effort model can estimate the time of 79% of the programmers with the error of or less than 30%. Conclusions: Our study suggests that the LOC added, modified, and deleted metrics are good predictors for estimating the cost of software maintenance. Effort estimation models for maintenance work may use the LOC added, modified, deleted metrics as the independent parameters instead of the simple sum of the three. Another implication is that reducing business rules of the software requires a sizable proportion of the software maintenance effort. Finally, the differences in effort distribution among the maintenance types suggest that assigning maintenance tasks properly is important to effectively and efficiently utilize human resources. Keywords: software maintenance; software estimation; maintenance experiment; COCOMO; maintenance size 1. Introduction Software maintenance is crucial to ensuring useful lifetime of software systems. According to previous studies [1][4][29], the majority of software related work in organizations is devoted to maintaining the existing software systems rather than building new ones. Despite advances in programming languages and software tools that have changed the nature of software maintenance, programmers still spend a significant amount of effort to work with source code directly and manually. Thus, it is still an important challenge in software engineering community to assess maintenance cost factors and develop techniques that allow programmers to accurately estimate their maintenance work. A typical approach to building estimation models is to determine what factors and how much they affect the effort at different levels and then use these factors as the input parameters in the models. For software maintenance, the modeling process is even more challenging. The maintenance effort is affected by a large number of factors such as size and types of maintenance work, personnel capabilities, the level of programmer s familiarity with the system being maintained, processes and standards in use, complexity, technologies, the quality of existing source code and its supporting documentation [5][18]. There has been tremendous effort in software engineering community to study cost-driven factors and the amount of impact they have on maintenance effort [6][20]. A number of models have been proposed and applied in practice such as [2][5][12]. Although maintenance size measured in source lines of code (LOC) is the most widely used factor in these models, there is a lack of agreement on what to include in the LOC metric. While some models determine the metric by summing the number of LOC added, modified, and deleted [2][21], others such as [5] use only LOC that is added and modified. Obviously, the latter assumes that the deleted LOC is not significantly correlated with maintenance effort. This inconsistency in using the size measure results in discrepancies in strategies proposed to improve software productivity and problems in comparing and converting estimates among estimation models. * Corresponding author. Tel ; fax: addresses: nguyenvu@usc.edu (V. Nguyen), boehm@usc.edu (B. Boehm), phongphand@kmutnb.ac.th (P. Danphitsanuphan)

3 In this paper, we describe a controlled experiment of student programmers performing maintenance tasks on a small C++ program. The purpose of the study was to assess the size and effort implications and labor distribution of three different maintenance types and to describe estimation models to predict the programmer s effort on maintenance tasks. We focus the study on enhancive, corrective, and reductive maintenance types according to a maintenance topology proposed by Chapin et al. [9]. We chose to study these maintenance types because they are the ones that change the business rules of the system by adding, modifying, and deleting the source code. They are typically the most common activities of software maintenance. The results of our study suggest that the corrective maintenance is less productive than enhancive and reductive maintenance. These results are largely consistent with the conclusion from previous studies [2][17]. The results further provide evidence about the effort distribution of maintenance tasks in which program comprehension requires as much as 50% of maintainer s total effort. In addition, our results on effort estimation models show that using three separate LOC added, modified, and deleted metrics as independent variables in the model will likely result in higher estimation accuracies. The rest of the paper is organized as follows. Section 2 gives a discussion on the related work. Section 3 provides a method for calculating the equivalent LOC in maintenance programs. The experiment design and results are discussed in Sections 4 and 5. Section 6 describes models to estimate programmers effort on maintenance tasks. Section 7 gives some discussions on the results. Section 8 discusses various threats to the validity of the research results, and the conclusions are given in Section Related Work Many studies have been published to address different size and effort related issues of software maintenance and propose approaches to estimating the cost of software maintenance work. To help better understand and access software maintenance work, Swanson [31] proposes a topology that classifies software maintenance into adaptive, corrective, and perfective maintenance types. This topology has become popular among researchers, and the IEEE has adapted these types in its Standard for Software Maintenance [19] along with an additional preventive maintenance type. In their proposed ontology of software maintenance, Kitchenham et al. [22] define two maintenance activity types, corrections and enhancements. The former type is equivalent to adaptive maintenance type while the latter can be generally equated to adaptive, perfective, and preventive maintenance types that are defined in Swanson s and IEEE s definitions. Chapin et al. [9] proposed a fine-grained twelve types of software maintenance and evolution. These types are classified into four clusters support interface, documentation, software properties, and business rules, respectively listed in the order of their impact on the software. The last cluster, which consists of reductive, corrective, and enhancive types, includes all activities that alter the business rules of the software. Chapin et al. s classification does not have a clear analogy with the types defined by Swanson. As an exhaustive topology, however, it includes not only Swanson s and IEEE s maintenance types but also other maintenance-related activities such as training and consulting. Empirical evidence on the distribution of effort among maintenance activities helps estimate maintenance effort more accurately through the use of appropriate parameters for each type of maintenance activity and helps better allocate maintenance resources. It is also useful to determine effort estimates for maintenance activities that are performed by different maintenance providers. Basili et al. [2] report an empirical study to characterize the effort distribution among maintenance activities and provide a model to estimate the effort of software releases. Among the findings, isolation activities were found to consume a higher proportion of effort in error correction than in enhancement changes, but a much smaller proportion of effort was spent on inspection, certification, and consulting in error correction. The other activities, which include analysis, design, and code/unit test, were found to take virtually the same proportions of effort in comparison between these two types of maintenance. Mattsson [25] describes a study on the data collected from four consecutive versions of a six-year object-oriented application framework project. The study provides evolutional trends on the relative effort distribution of four technical phases (analysis, design, implementation, and test) across four versions of the project, showing that the proportion of implementation effort tends to decrease from the first version to the forth, while the proportion of analysis effort follows a reversed trend. Similarly, Yang et al. [32] present results from an empirical study on the effort distribution of a series of 9 projects delivering respective 9 versions a software product. All projects are maintenance type except the first project which delivers the first version of the series. The coding activity was found to account for the largest proportion of effort (42.8%) while the requirements and design activities consume only 10.2% and 14.5%, respectively. In addition to analyzing the correlation between maintenance size and productivity metrics and deriving effort estimation models for maintenance projects, De Lucia et al. [13] describe an empirical study on the effort distribution among five phases, namely inventory, analysis, design, implementation, and testing. The analyses were based on data obtained from a large Y2K project following the maintenance processes at a software organization. Their results show that the design phase is the most expensive, consuming about 38% of total effort, while the analysis and implementation phases account for small proportions, about 11% each. These results are somewhat contrary the results reported in Yang et al s [32]. A more recent study reported by the same authors (De Lucia et al.) presents estimation models and the distribution of effort from a different project in the same organization [14]. A number of studies have been reported to address the issues related to characterizing size metrics and building cost estimation models for software maintenance. In his COCOMO model for software cost estimation, Boehm presents an approach to estimating the annual effort required to maintain a software product. The approach uses a factor named Annual Change Traffic (ACT) to adjust the maintenance effort based on the effort estimated or actually spent for developing the

4 software [7]. ACT specifies the estimated fraction of LOC which undergo change during a typical year. It includes source code addition and modification, but excludes deletion. If information is sufficient, the annual maintenance effort can be further adjusted by a maintenance effort adjustment factor computed as the product of predetermined effort multipliers. In a major extension, COCOMO II, the model introduces new formulas and additional parameters to compute the size of maintenance work and the size of reused and adapted modules [5]. The additional parameters take into account the effects such as the complexity of the legacy code and the familiarity of programmers with the system. In a more recent model extension to estimating maintenance cost, Nguyen proposes a set of formulas that unifies two COCOMO II s reuse and maintenance sizing methods. The extension also takes into account the size of source code deletions and calibrates new rating scales of the cost drivers specific to software maintenance. Basili et al. [2], together with characterizing the effort distribution of maintenance releases, describe a simple regression model to estimate the effort needed to maintain and deliver a release. The model uses a single variable, LOC, which was measured as the sum of added, modified and deleted LOC including comments and blanks. The prediction accuracy was not reported although the coefficient of determination was relative high (R 2 = 0.75), indicating that LOC is an important predictor of the maintenance effort. Jorgensen evaluated eleven different models to estimate the effort of individual maintenance tasks using regression, neural networks, and pattern recognition approaches [21]. The models use the size of maintenance tasks, which is also measured as the sum of added, updated, and deleted LOC, as the main size input. The best model could generate effort estimates within 25 percent of the actuals 26 percent of the time, and the mean of relative error (MMRE) is 100 percent. Several previous studies have proposed and evaluated models exclusively for estimating the effort required to implement corrective maintenance tasks. Lucia et al. used the multiple linear regression to build effort estimation models for corrective maintenance projects [12]. Three models were built using coarse-grained metrics, namely the number of tasks requiring source code modification (NA), the number of tasks requiring fixing of data misalignment (NB), the number of other tasks (NC), the total number of tasks, and LOC of the system to be maintained. They evaluated the models on 144 observations, each corresponding to one-month period, collected from five corrective maintenance projects in the same software services company. The best model, which includes all metrics, achieved effort estimates within 25 percent of the actuals percent of the time and MMRE of 32.25%. When comparing with the non-linear model previously used by the same company, they suggested that the linear model that uses the same variables produces higher estimation accuracies. They also showed that taking into account the difference in types of corrective maintenance tasks can improve the performance of the estimation model. 3. Calculating Equivalent LOC In software maintenance, the programmer works on the source code of the existing system. The delivered maintained software includes source lines of code reused, modified, added, and deleted from the existing system. Moreover, the maintenance work is constrained by the existing architecture, design, implementation, and technologies used. These activities require maintainers extra time to comprehend, test, and integrate the maintained pieces of code. Thus, an acceptable estimation model should take into account these characteristics of software maintenance through its estimation of either size or effort. In this experiment, we adapt the COCOMO II reuse model to determine the equivalent LOC of the maintenance tasks. The model involves determining the amount of software to be adapted, the percentage of design modified (DM), the percentage of code modified (CM), the percentage of integration and testing (IM), the degree of Assessment and Assimilation (AA), the understandability of the existing software (SU), and the programmer s unfamiliarity with the software (UNFM). The last two parameters directly account for the programmer s effort to comprehend the existing system. The equivalent LOC formula is defined as Equivalent LOC = TRCF x AAM (1) S AAF TRCF 2 AAF(1 [1 (1 AAF) ] x SU xunfm, for AAF 1 AAM SU x UNFM AAF, for AAF Where, TRCF = the total LOC of task-relevant code fragments, i.e., the portion of the program that the maintainers have to understand to perform their maintenance tasks. S = the size in LOC. SU = the software understandability. SU is measured in percentage ranging from 10% to 50%. UNFM = the level of programmer unfamiliarity with the program. The UNFM rating scale ranges from 0.0 to 1.0 or from Completely familiar to Completely unfamiliar. Numeric values of SU and UNFM are given in Table 2 in Appendix A.

5 LOC is the measure of logical source statements (i.e., logical LOC) according to COCOMO II s LOC definition checklist given in [5] and further detailed in [26]. LOC does not include comments and blanks, and more importantly it counts the number of source statements regardless of how many lines a statement can span. TRCF is not a size measure of the whole program to be maintained. Instead, it only reflects portions of the program s source code that are touched by the programmer. Ko et al. studied maintenance activities performed by students, finding that the programmers collected working sets of task-relevant code fragments, navigated dependencies, and editing the code within these fragments to complete the required tasks [23]. This as-needed strategy [24] does not require the maintainer to understand code segments that are not relevant to the task. The equation (1) reflects this strategy by including only task-relevant code fragments rather than the whole adapted program. The task-relevant code fragments are functions and blocks of code that are affected by the changes. 4. Description of the Experiment 4.1. Hypotheses According to Boehm [7], programmer s maintenance activities consist of understanding maintenance task requirements, code comprehension, code modification, and unit testing. Although the last two activities deal with source code directly, empirical studies have shown high correlations between the overall maintenance effort and the total LOC added, modified, and deleted (e.g., [2][21]). We hypothesize that these activities have comparable distributions of programmer s effort regardless of what types of changes are made. Indeed, with the same cost factors [5] such as program complexity, project and personnel attributes, the productivity of enhancive tasks is expected to have no difference with that of corrective and reductive maintenance. Thus, we have the following hypotheses: Hypothesis 1: There is no difference in the productivity among enhancive, corrective, and reductive maintenance. Hypothesis 2: There is no difference in the division of effort across maintenance activities The Participants and Groups We recruited 1 senior and 23 graduate computer-science students who were participating in our directed research projects. The participation in the experiment was voluntary although we gave participants a small incentive by exempting participants from the final assignment. By the time the experiment was carried, all participants had been asked to compile and test the program as a part of their directed research work. However, according to our pre-experiment survey, their level of unfamiliarity with the program code (UNFM) varies from Completely unfamiliar to Completely familiar. We rated UNFM as Completely unfamiliar if the participant had not read the code and as Completely familiar if the participant had read and understood source code, and modified some parts of the program prior to the experiment. The performance of participants is affected by many factors such as programming skills, programming experience, and application knowledge [5][8]. We assessed the expected performance of participants through pre-experiment surveys and review of participants resumes. All participants claimed to have programming experience in either C/C++ or Java or both, and 22 participants already had working experience in the software industry. On average, participants claimed to have 3.7 (±2) years of programming experience and 1.9 (±1.7) years of working experience in the software industry. We ranked participants by their expected performance based on their C/C++ programming experience, industry experience, and level of familiarity with the program. We then carefully assigned participants to each group in a manner that the performance capability among groups is balanced as much as possible. As a result, we had seven participants in the enhancive group, eight in the reductive group, and nine in the corrective group. We will further discuss in Section 8 potential threats related to the group assignments, which may result in validity concerns of the results Procedure and Environment Participants performed the maintenance tasks individually in two sessions in a software engineering lab. Two sessions had the total time limit of 7 hours, and participants were allowed to schedule their time to complete these sessions. If participants did not complete all tasks in the first session, they continued the second session on the same or a different day. Prior to the first session, participants were asked to complete a pre-experiment questionnaire on their understanding of the program and then were told how the experiment would be performed. Participants were given the original source code, a list of maintenance activities, and a timesheet form. Participants were required to record time on paper for every activity performed to complete maintenance tasks. Time information includes start clock time, stop clock time, and interruption time measured in minute. Participants used Visual Studio 2005 on Windows XP. The program s user manual was provided to help participants setup the working environment and compile the program.

6 Prior to completing the assignments, participants were given prepared acceptance test cases and were told to run these test cases to certify their updated program. These test cases covered the added, affected, and deleted capabilities of the program. Participants were also told to record all defects found during the acceptance test and not to fix or investigate these defects The UCC Program The UCC was a program that allowed users to count LOC-related metrics such as statements, comments, directive statements, and data declarations of a source program. It also allowed users to compare the differentials between two versions of a source program and determine the number of LOC added, modified, and deleted. The program was developed and distributed by the USC Center for Systems and Software Engineering. The UCC program had three main modules (1) read input parameters and parse source code (2) analyze, compare, and count source code (3) produce results to output files. The UCC program had 5,188 logical LOC and consisted of 20 C++ classes. The program was well-structured and well-commented, but parts of the program were relatively high coupling. Thus, the SU parameter was estimated to be Nominal or a numeric value of 30% Maintenance Tasks The maintenance tasks were divided into three groups, enhancive, reductive, and corrective, each being assigned to one participant group. These maintenance types fall into the business rules cluster, according to the topology proposed by Chapin et al. [9]. There were five maintenance tasks for the enhancive group and six for the other groups. The enhancive tasks require participants to add five new capabilities that allow the program to take an extra input parameter, check the validity of the input and notify users, count for and while statements, and display a progress indicator. Since these capabilities are located in multiple classes and methods, participants had to locate the appropriate code to add and possibly modify or delete the existing code. We expected that majority of code would be added for the enhancive tasks unless participants had enough time to replace the existing code with a better version of their own. The reductive tasks ask for deleting six capabilities from the program. These capabilities involve handling an input parameter, counting blank lines, and generating a count summary for the output files. The reductive tasks emulate possible needs from customers who do not want to include certain capabilities in the program because of redundancy, performance issues, platform adaptation, etc. Similar to the enhancive tasks, participants need to locate the appropriate code and delete lines of code, or possibly modify and add new code to meet the requirements. The corrective tasks call for fixing six capabilities that were not working as expected. Each task is equivalent to a user request to fix a defect of the program. Similar to the enhancive and reductive tasks, corrective tasks handle input parameters, counting functionality, and output files. We designed these tasks in such a way that they required participants to mainly modify the existing lines of code Metrics The independent variable was the type of maintenance, consisting of enhancive, corrective, and reductive types. The dependent variables were programmer s effort and size of change. The programmer s effort was defined as the total time the programmer spent to work on the maintenance task excluding interruption time; the size of change was measured by three LOC metrics, LOC added, modified, and deleted Maintenance Activities We focus on the context of software maintenance where the programmer performs quick fixes according to customer s maintenance requests [3]. Upon receiving the maintenance request, the programmers validate the request and contact the submitter for clarifications if needed. They then investigate the program code to identify relevant code fragments, edit, and perform unit tests on the changes [2][23]. In the experiment, we grouped this scenario into four maintenance activities: Task comprehension includes reading, understanding task requirements, and asking for further clarification. Isolation involves locating and understanding code segments to be adapted. Editing code includes programming and debugging the affected code. Unit test involves performing tests on the affected code. Obviously, these activities do not include design modifications because small changes and enhancements hardly affect the system design. Indeed, since we focus on the maintenance quick-fix, the maintenance request often does not affect the existing design. Integration test activities are also not included as the program is by itself the only component, and we perform acceptance testing independently to certify the completion of tasks.

7 5. Results In this section, we provide the results of our experiment, the analysis and interpretation of the results. We use one-sided Mann-Whitney U Test with the typical 0.05 level of significance to test the statistically significant difference between the two sets of values. We also perform Kruskal-Wallis test to validate the differences among the groups Data Collection Data was collected from three different sources including surveys, timesheet, and the source code s changes. From the surveys, we determined participants programming skills; programming language, platform, and industry experiences; and level of unfamiliarity with the program. Maintenance time was calculated as the duration between finish and start time excluding the interruption time if any. The resulting timesheet had a total of 490 records totaling 4,621 minutes. On average, each participant recorded 19.6 activities with a total of minutes or 3.2 hours. We did not include the acceptance test effort because it was done independently after the participants completed and submitted their work. Indeed, in a real-world situation the acceptance test is usually performed by customers or an independent team, and their effort is often not recorded as the effort spent by the maintenance team. The sizes of changes were collected in terms of the number of LOC added, modified, and deleted by comparing the original with the modified version. These LOC values were then adjusted using the sizing method described in Section 3 to obtain equivalent LOC. We measured the LOC of task-relevant code fragments (TRCF) by summing the size of all affected methods. As a LOC is corresponding to one logical source statement, one LOC modified can easily be distinguished from a combination of one added and one deleted Task Completion We did not expect participants to complete all the tasks, given their capability and availability. Because the tasks were independent, we were able to identify which source code changes are associated with each task. Six participants spent time on incomplete tasks with a total of 849 minutes or 18% of total time. On average, the enhancive group spent most time, 26%, on incomplete tasks compared with 16% and 12% by the corrective and reductive groups, respectively. The number of tasks completed by participants in the enhancive group is the lowest, 69%, while higher task completion rates were achieved by participants in the other groups, 96% in the enhancive and 98% in the corrective group. Thereafter, we exclude time and size associated with incomplete tasks because the time spent on these tasks did not actually produce any result to meet the task requirements. 26% 12% 19% 8% 18% 10% 20% 20% 35% 31% 41% 27% Reductive Group 53% Enhancive Group Corrective Group 21% 26% 10% 12% Task comprehension Isolation 37% 32% 35% 27% Overall Reductive Group Editing code Unit test Figure 1. Effort distribution

8 Equivalent SLOC 5.3. Distribution of Effort The first three charts in Figure 1 show the distribution of effort of four different activities by participants in each group. The forth chart shows the overall distribution of effort by combining all three groups. Participants spent the largest proportion of time on coding, and they spent much more time on the isolation activity than testing. By comparing the distribution of effort among the groups, we can see that proportions of effort spent on the maintenance activities vary vastly among three groups. The task comprehension activity required the smallest proportions of effort. The corrective group spent the largest share of time for code isolation, twice as much as that of the enhancive group, while the reductive group spent much more time on unit test as compared with the other groups. That is, updating or deleting existing program capabilities requires a high proportion of effort for isolating the code while adding new program capabilities needs a large majority of effort for editing code. The enhancive group spent 53% of total time on editing, twice as much as that spent by the other groups. At the same time, the corrective group needed 51% of total time on program comprehension related activities including task comprehension and code isolation. Participants on the enhancive group spent less time on the isolation activity but more time on writing code while participants on the corrective group did the opposite. Moreover, the sums of percentages of the coding and code isolation activities of these groups are almost the same (72% and 73%). The Kruskal-Wallis rank-sum tests confirm the differences in the percentage distributions of editing code (p = ) and code isolation (p = ) among these groups. Based on these test results, we can therefore reject Hypothesis Productivity Figure 2 shows that, on average, the enhancive group produced almost 1.5 times as many LOC as did the reductive group and almost 4 times as many LOC as did the corrective group. Participants in the enhancive group focused on adding new, the reductive group on deleting existing, and the corrective group on modifying LOC. As a result, the enhancive group has the highest number of LOC added while no LOC was deleted; the reductive group has the highest number of LOC deleted, while no LOC was added; and the corrective group has few LOC added and deleted. This pattern was dictated by our task design: the enhancive tasks require participants to mainly add new capabilities which result in new code; the corrective tasks require modifying existing code; and the reductive tasks require deleting existing code. For example, participants in the reductive group modified 20% and deleted the rest 80% of the total affected LOC. The box plots shown in Figure 3 provide the 1 st quartile, median, 3 rd quartile, and the outliers of the productivity for three groups. The productivity is defined as the sum of equivalent LOC added, modified, and deleted divided by the total effort measured in person-hour. According to the sizing method defined in Equation (1), this productivity measure accounts for the effects of software understanding. One participant had a much higher productivity than those of any other participants. A closer look at this data point reveals that this participant had 8 years of industry experience and was working as a full-time software engineer at the time of experiment LOC Added LOC Modified LOC Deleted 20 0 Enhancive 1 Reductive 2 Corrective 3 Figure 2. Average equivalent LOC added, modified and deleted in the three groups

9 Equivalent LOC/person-hour Figure 3. Participants Productivity As indicated in the box plots, the productivity of the corrective group is much lower than that of the other groups. On average, for each hour participants in the corrective group produced 8 (± 1.7) LOC, which is 0.4 times as many as did the reductive and enhancive groups produce, 20 (± 8.4) and 21 (± 11.3) respectively. One-sided Mann-Whitney U test confirms these productivity differences (p = for the difference between the enhancive and corrective groups and p = between the reductive and corrective groups); and there is a lack of statistical evidence to indicate the productivity difference between enhancive and reductive groups (p = 0.45). Kruskal-Wallis rank-sum test also indicates statistically significant difference in the productivity among these groups (p = ). Hypothesis 1 is therefore rejected. 6. Explanatory Maintenance Effort Models Understanding the factors that influence the maintenance cost and predicting future cost is one of the best interests of software engineering practitioners. Reliable estimates enable practitioners to make informed decisions and ensure the success of software maintenance projects. With the data obtained from the experiment, we are interested in deriving models to explain and predict time spent by each participant on the maintenance tasks Models Previous studies have identified numerous factors that affect the cost of maintenance work. These factors reflect the characteristics of the platform, program, product, and personnel of maintenance work [5][8]. In the context of this experiment, personnel factors are most relevant. Other factors are relatively invariant, hence irrelevant, because participants performed the maintenance tasks in the same environment, same product, and same working set. Therefore, in this section we examine the models that use only factors that are relevant to the context of this experiment. Effort Adjustment Factor (EAF) is the product of the effort multipliers defined in the COCOMO II model, representing overall effects of the model s multiplicative factors on effort. In this experiment, we define EAF as the multiplicative of programmer capability (PCAP), language and tools experience (LTEX), and platform experience (PLEX). We used the same rating values for these cost drivers that are defined in the COCOMO II Post-Architecture model. We rated PCAP, LTEX, PLEX values based on participant s GPA, experience, pre-test, and post test scores. The numeric values of these parameters are given in Appendix A. If the rating fell in between two defined rating levels, we divided the scale into finer intervals by using a linear extrapolation from the defined values of two adjacent rating levels. This technique allowed specifying more precise ratings for the cost drivers. We will investigate the following models Enhancive Reductive Corrective M 1 : E = β 0 + β 1 * S 1 * EAF M 2 : E = β 0 + (β 1 * Add + β 2 * Mod + β 3 * Del) * EAF M 3 : E = β 0 + β 1 * S 2 * EAF M 4 : E = β 0 + (β 1 * Add + β 2 * Mod) * EAF Where, E is the total minutes that the participant spends on completed maintenance tasks. Add, Mod, and Del represent the number of LOC added, modified, and deleted by the participant for all completed maintenance tasks, respectively.

10 S 1 is the total equivalent LOC that was added, modified, and deleted by the participant, that is, S 1 = Add + Mod + Del. S 2 is the total equivalent LOC that was added and modified, or S 2 = Add + Mod. EAF is the effort adjustment factor described above. As we can see in the models equations, the LOC metrics Add, Mod, and Del are all adjusted by EAF, taking into account the capability and experience of the participant. Models M 3 and M 4 differ from models M 1 and M 2 in that they do not include the variable Del. Thus, differences in the performance of models M 3 and M 4 versus models M 1 and M 2 will reflect the effect of the deleted LOC metric. The estimates of coefficients in model M 2 determine how this model differs from model M 1. This difference is subtle but significant because M 2 accounts for the impact of each type of LOC metrics on maintenance effort. In the next two subsections, we will estimate the coefficients of the models using the experiment data and evaluate the performance as well as the structural differences among them Model Performance Measures We use two most prevalent model performance measures MMRE and PRED as criteria to evaluate the accuracy of the models [30]. These metrics are derived from the basic magnitude of relative error (MRE), which is defined as, y ˆ i yi MREi y i Where y and i ŷ are the actual and the estimate of the ith estimate, respectively. The mean of MRE of N estimates is i defined as 1 N MREi N i 1 MMRE Clearly, according to this formula, extreme relative errors can have a significant impact on MMRE, affecting the overall conclusion about the performance of the model under evaluation. To overcome this problem, PRED is often used as an important complement measure. PRED(l) is defined as the percentage of estimates where MRE is not greater than l, that is PRED(l) = k/n, where k is the number of estimates with MRE values falling in between 0 and l. PRED values range from 0 to 1. High performance models are often expected to relate with high PRED and small MMRE. Conte et al. [11] proposed PRED(0.25) 0.75 and MMRE 0.25 as standard acceptance levels for effort estimation models. In this study, we chose to report MMRE, PRED(0.25), and PRED(0.3) values because they have been the most widely-used measures for evaluating the performance of the software estimation model [15][27][30]. In addition, we use the coefficient of determination R 2 metric as a criterion to evaluate the explanation power of the variables used in the models Results We collected the total of 24 data points, each having LOC added (Add), modified (Mod), deleted (Del), actual effort (E), and effort adjustment factor (EAF). Fitting the 24 data points (see Table 3 in Appendix B) to models M 1, M 2, M 3, M 4 using least squares regression, we obtained M 1 : E = * S 1 * EAF M 2 : E = (2.8*Add+5.3*Mod+1.3*Del) * EAF M 3 : E = * S 2 * EAF M 4 : E = (2.3 * Add * Mod) * EAF Table 1 Summary of results obtained from fitting the models Metrics M 1 M 2 M 3 M 4 R β (p = 10-3 ) 43.9 (p = 0.06) (p = 10-7 ) 79.1 (p = 4.8*10-4 ) β (p = 10-4 ) 2.8 (p = 10-7 ) 2.2 (p = 10-5 ) 2.3 (p = 10-6 ) β (p = 10-5 ) (p = 2.7*10-4 ) β (p = 0.02) - - MMRE 33% 20% 28% 27% PRED(0.3) 58% 79% 75% 79% PRED(0.25) 46% 71% 75% 71%

11 MRE Table 1 shows the statistics obtained from four models. The p-values are shown next to the estimates of coefficients. In all models, the estimates of all coefficients but β 0 on M 2 are statistically significant (p 0.05). It is important to note that β 1, β 2, and β 3 in model M 2 are the estimates of coefficients of the Add, Mod, and Del variables, respectively. They reflect variances in the productivity of three maintenance types that we discussed above. These estimates show that the Add, Mod, and Del variables have significantly different impacts on the effort estimate of M 2. One modified LOC affects as much as two added or four deleted LOC. That is, modifying one LOC is much more expensive than adding or deleting it. As shown in Table 1, although Del has the least impact on effort as compared to Add and Mod, it is statistically correlated with effort (p = 0.02). Thus, it is implausible to ignore the effect of the deleted LOC on maintenance effort. Models M 1 and M 3, which both use a single combined size parameter, have the same slope (β 1 = 2.2), indicating that the size parameters S 1 and S 2 have the same impact on the effort. The estimates of the intercept (β 0 ) in the models indicate the average overhead of the participant s maintenance tasks. The overhead seems to come from non-coding activities such as task comprehension and unit test, and these activities do not result in any changes in source code. Model M 3 has the highest overhead (110 minutes), which seems to compensate for the absence of the deleted LOC in the model. The coefficient of determination (R 2 ) values suggest that 75% variability in the effort is predicted by the variables in M 2 while only 50%, 55%, and 64% of that predicted by the variables in M 1, M 3, and M 4, respectively. It is interesting to note that both models M 3 and M 4, which did not include the deleted LOC, generated higher R 2 values than did model M 1. Moreover, the R 2 values obtained by models M 2 and M 4 are higher than those of models M 1 and M 3 that use a single combined size metric M1 M3 M2 M Participant # Figure 4. The MRE values obtained by the models in estimating the effort spent by 24 participants The MMRE, PRED(0.3), and PRED(0.25) values indicate that M 2 is the best performer, and it outperforms M 1, the worst performer, by a wide margin. Model M 2 produced estimates with a lower error average (MMRE = 20%) than did M 1 (MMRE = 33%). For model M 2, seventy nine percent of the estimates (19 out of 24) have the MRE values of less than or equal to 30%. In other words, the model produced effort estimates that are within 30 percent of the actuals 79 percent of the time. Both models M 3 and M 1 use a single size parameter, but M 3 outperforms M 1, based on any of the model performance measures. PRED(0.25) of M 3 is much higher than that of M 1 as we can see in Figure 4. The MRE values of M 3 are less than those of M 1 in most of the estimates. It is clear that the Del component in M 1 negatively affects the performance of the model. On the contrary, the Del component in M 2 contributes to improving the performance. As indicated in Table 1, model M 2 produced more accurate estimates than did model M 4, noting that the size parameters in models M 2 and M 4 are separated. These models outperform the combined size parameter models M 1 and M 3, indicating that the improvement in the performance results from using size metrics as independent variables. As shown in Figure 4, all models produced low estimation accuracies when estimating the time of participant 13, the MRE values being ranging from 0.93 up to A closer examination of this data point reveals that all of the models overestimated

12 the participant s time and that the participant s capability and experience were estimated to be low, but his productivity was high. In fact, the participant, who worked in the enhancive group, spent a total of 200 minutes on the tasks but completed only two tasks that took him 57 minutes. We suspect that he may have had experience in resolving these specific tasks while struggling resolve the others. 7. Discussions Although participants in the enhancive group completed the least number of tasks, only 69% of tasks completed, they were most productive. Looking at the distribution of effort shown in Figure 1, we see that they spent the majority of time on coding. As a result, more code was produced and higher productivity was achieved. This observation suggests that one must be cautious about the interpretation of the productivity measure and what size metric is used to derive it. Moreover, our results imply that the productivity should be interpreted in the context of what types of maintenance activities are performed. For example, the productivity of a programmer who performs functional enhancements should not be used as an indicator to evaluate the performance of another programmer who fixes software faults. Understanding the distribution of effort of different types of maintenance tasks allows managers to allocate appropriate resources and manage reasonable plans for maintenance work. Our results show that the distributions of effort differ across different maintenance types. These results are not fully consistent with the results of Basili at al s study in which the proportions of effort spent on design and code/unit test were found to be almost the same [2]. In our study, the proportions of unit test effort in both enhancive and corrective maintenance types are almost identical while the coding activity consumed a much larger proportion of effort in the enhancive maintenance type. However, it is worth noting that we used a different set of maintenance activities that cannot fully be mapped into their maintenance activities. Under the context of our experiment, participants did not perform the analysis, design, and inspection activities that were reported in their study. Nonetheless, we found that program comprehension activities, which include task comprehension and code isolation activities, of the corrective and reductive groups require as much as 50% of effort, which is largely consistent with the results reported by Basili et al. [2], Fjelstad and Hamlen [16]. As discussed above, some studies use the sum of LOC added, modified, and deleted as a single size measure while others use the sum of only LOC added and modified. In this study, we evaluated both methods. Surprisingly, our results suggest that excluding the deleted LOC in the sum likely gives a better size metric for predicting maintenance effort (see Table 1). However, both of these methods were shown to be inferior to the ones that use the LOC metrics as independent variables. As shown in Table 1, models M 2 and M 4, which use the LOC metrics as independent variables, outperform both M 1 and M 3. It can be inferred that each LOC metric has a different impact on maintenance effort. Thus, each LOC metric should to be adjusted by a factor to derive a better size metric. The deleted LOC metric was found to be a statistically significant parameter for estimating maintenance effort. Including this metric in the model that uses the independent LOC metrics likely improves the performance of the model. In the case where the simple sum is used, however, excluding the deleted LOC in the sum seems to generate more favorable models. This seemingly contradictory result needs further investigation and validation. In our study, the deleted LOC is the total number of LOC deleted only from the modified module. Deleting source code in the modified module requires detailed understanding of the task-relevant code fragments. On the other hand, if the whole module is deleted, the programmers may not need to acquire detailed understanding of the module. They may instead understand the code fragments where the module to be deleted references to and modify these code fragments appropriately. As a result, deleting the whole module would require much less effort than deleting the same number of LOC in the modified module. 8. Threats to Validity Threats to Internal Validity There are several threats to internal validity. The capabilities of the groups may differ significantly. We used a matchedwithin-subjects design [10] to assign participants to groups, which helped to reduce differences. In addition, we scored participants answers on our pre- and post-experiment questionnaire about participants C++ experience and understanding of the program. We performed t-test to test the differences in scores among the groups. The result indicated no statistically significant difference between any two groups (p > 0.23). Another threat is the accuracy of time logs recorded by participants. We told participants to record start, end, and interruption time for each maintenance activity. This required participant to input their time consistently from one activity to another. In addition, we used the hardcopy timesheet instead of the softcopy one as we believe that it is more difficult to manipulate the time in the hardcopy and if manipulations were made, we could identify easily. The time data was found to be highly reliable. A third threat concerns possible differences in complexity of the maintenance tasks. As the complexity is one of the key factors that significantly affects the productivity of maintenance tasks, the differences may cause the productivity to be