Chapter 2. Literatures Review

Size: px
Start display at page:

Download "Chapter 2. Literatures Review"

Transcription

1 Chapter 2 Literatures Review As we mentioned previously, before the machine learning techniques are applied to the estimation domain, the most estimation models are analytic-based. They estimate cost through the mathematic formula(s), but either they consider too less factors, or too many subjective factors are given. They both cause the estimation to reach very high Mean Absolute Relative Error (MARE). After the application of machine learning techniques raise and develop, the problems of analytic-based models are solved. Now we will give complete descriptions for them in the following sections. 2.1 Analytic-Based models The analytic-based models just take account into lines of code (LOC) or thousands of lines of code (KLOC) to be the size metric of software project. Some historical data are listed in the table 2-1. Table 2-1. Example of historical project data Project Person-Months $K KLOC Pag. Of Doc. Num. Of Error Num. Of Employee In the table 2-1, for example, the first one project 01 needs 24 person-months to complete, its cost is 168,000 dollars, the size is 12.1 KLOC, its pages of document is 365, total number of error is 29, and 3 employees are involved. According to the information, the sized-based models can estimate the productivity, quality, expense, and document by equations. The equations are listed below: 2-1

2 Productivity = KLOC / Personnel-Month Quality = Number of Error / KLOC Expense = $K / KLOC Document = Pages of Document / KLOC The metrics based upon LOC or KLOC are still argued, especially for that the LOC or KLOC is decided subjectively. Based on the idea of LOC, COCOMO and Function Point etc. models consider more factors that determine the effort or duration required to complete a software project. Next, we will describe them in detail COCOMO The last version of COCOMO is COCOMO II that was implemented as Costar 6.0 in We will base on its estimation mechanism to explain the axiom of original COCOMO. The COCOMO II primarily estimates the required effort (measured in Person-Months) based on the manager s estimation of the software project s size, i.e., LOC. The effort is predicted in the following equation: Effort = 2.94 * EAF * (KLOC) E Where EAF is Effort Adjustment Factor derived from the Cost Drivers E is an exponent derived from the five Scale Drivers Scale Drivers 2-2

3 In the COCOMO II model, some of most important factors that affect the duration and cost of the software project are called Scale Drivers. There are five Scale Drivers in the model, including Precedentedness, Development Flexibility, Architecture / Risk Resolution, Team Cohesion, and Process Maturity. Cost Drivers The cost drivers are multiplicative factors that determine the effort required to complete the software project. For example, if a software project is to control an airplane fight, then we can set Required Software Reliability (RELY) cost driver to Very High. Original COCOMO contains fifteen cost drivers. Now COCOMO II expends them to seventeen. The complete cost drivers can refer to appendix A COCOMO II Effort Equation As an example, a project with all Nominal Cost Drivers and Scale Drivers would have an EAF of 1.00 and exponent, E, of Assuming that the project is projected to consist of 8000 source lines of code, then COCOMO estimates that 28.9 person-months of effort is required to complete it: Effort = 2.94*(1.0)*(8) = 28.9 person-months Effort Adjustment Factor According to the idea above, the Effort Adjustment Factor (EAF) in the effort equation is simply the product of the effort multipliers corresponding to each of the cost drivers for the project. For example, if the project is rated Very High for Complexity (effort multiplier of 1.34), Low for Language & Tool Experience (effort multiplier of 1.9), and all of the other cost drivers are rated to be Nominal (effort 2-3

4 multiplier of 1.00), the EAF is the product of 1.34 and Effort Adjustment Factor = EAF = 1.34*1.09 = 1.46 Effort = 2.94*(1.46)*(8) = 42.3 person-months COCOMO II Schedule Equation In addition, COCOMO II has schedule equation to estimate the duration of the software project as well. The axiom of the equation is the same as effort equation. The schedule equation is listed below. Duration = 3.67*(Effort) SE Where Effort is the effort from effort equation SE is the schedule equation exponent derived from the five Drivers Function Point Function Point model (FP) had proposed by Albrecht in 1979 [Matson, 94]. The purpose of the model was applied to business application system. But it is not suitable to control system. By computing functions point, managers can obtain the corresponding cost that locates at certain range of function point. FP model decomposes system into several functions, which are external inputs (EI), external outputs (EO), logical internal files (LIF), external interfaces (EF) and external inquiries (EQ), as shown in table

5 Table 2-2. Functional Point Measurement Function Number Weight Simple Medium Complexity UFP EI EO EQ LIF EF Then, we can calculate the total functions points of software project with the following equation: FP = Total UFP * CAF Where UFP means the unadjusted function points. To calculate UFP, we need three steps. First, give numbers to EI, EO and so on. Secondly, subjectively give weight to each function. Thirdly, calculate the sum of UFP; i.e. it is to sum up each function s score by multiplying number of each function with corresponding weight. For example, EI=2, EO=3, EQ=10, LIF=2, EF=1, and their corresponding weights are 3, 4, 3, 10, and 7. Thus, we can obtain total UFP: Total UFP = 2*3+3*4+10*3+2*10+1*7= 75 After we obtain total UFP, we next calculate CAF (complexity adjusted factors), which is expressed in the following form: CAF = * SUM (Fi). 2-5

6 Where 0.65 and 0.01 are experienced constant values Fi can be calculated by answering fourteen questions called degree of influence (DI). The questions include data communications, distributed processing, performance objective, configuration load, transaction rate, on-line data entry, end-user efficiency, on-line update, complex process, reusability, installation ease, operational ease, multiple sites, and change facilitation. The DI takes values from 0 (no influence) to 5 (strongest influence). The value of calculated FP will correspond to some range that is defined by user, say that FP of 1000 means cost falling into the rang from $20000 to $ The FP has some disadvantages as well: First, the weights are given subjectively. It is argued most. Secondly, it is only suitable to business applications. Moreover, it will results in some problems when use the analytic-based model: 1. Definition of LOC is ambiguous: for instance, if programmer breaks a statement into two lines for readability, how do we count it? Furthermore, if we encounter a nested structure, it will be more difficult to count lines of code. Thus, LOC approach causes large difference between estimative and actual effort. 2. It is difficult to compare different languages: LOC of the high- level language may be less than low-level language, but the high-level language has more time complexity per line. Therefore, it is difficult to be on equal terms to compare two different development kits. 3. Too much emphasis on LOC is addressed: In addition to the lines of code, there are still other objective factors that must be considered. For example, type of database and tools used might influence the project in different weight. 4. Subjective factor: with using function point a manager has to subjectively give 2-6

7 weight to each of function and each of DI. Such approach causes the inconsistent result when different managers do the estimation work. In addition, the constant values (0.65 and 0.01) are formed with the experience, and they may be not suitable to the later software development environment [Kitchenham 97]. The same problems also happen to the COCOMO. The constants in the equation were trained from the historical data rather than from the domain where it will be applied. Therefore, their effect is doubt. The same situation also occurs to the drivers that are completely given by the personal opinion of estimator. Owing to the reasons described above, the estimated result of COCOMO is difficult to convince the people. 2.2 Machine Learning Based Models Those analytic models, such as COCOMO and function point, have been unable to demonstrate consistently adequate result with error of 100% or greater [Schofield, 98]. One possible reason those models have not proven fruitful is that they are often not unable to adequately model complex relationships that are apparent in software development. Moreover, the models can be successful in well-constrained environment, but they are apparently not flexible enough to handle others in domain. Recently, the researchers have turned their attention to the machine learning techniques, such as the ANN. The related techniques to our model include K-Means method, ANN, and CBR. Next, we will respectively discuss their axioms, related bottlenecks, and applications on cost estimation along with other machine learning technologies Artificial Neuron Networks The concept of ANN is inspired by the architecture of biological neural networks. 2-7

8 Within the ANN, the neurons are simple interconnected, and they are separated into three parts: input layer, hidden layer, and output layer; every neuron contains some function, such as Sigmon, to calculate by using its input and transfer the result to satisfy some threshold, and whatever input or output both must range from 0 to 1. The output result then becomes input of the others neurons in the network. The process will proceed until one or more results of output layer neurons are generated. In the process, a neuron s output is weighted randomly in the beginning. Afterward the network learns the relationships that are implicit between inputs and outputs through learning algorithm that is commonly backpropagation algorithm to improve accuracy of predictive. More detail discussion can be found in [Michael 97,Venkatachalam 93, Wittig et al. 97]. Although the ANN performs best in terms of prediction accuracy [Carolyn 99] among the proposed methods in machine learning area, its explanatory ability is the lowest. In the viewpoint of the managers, the reasons of decision-making are important. Thus, such disadvantage will make the project manager difficultly to explain what they have done. Furthermore, the ANN requires inputs in the range from 0 to 1 and this additional transformation will waste disk space and CPU time. Thirdly, they may converge prematurely to an inferior solution; there is no guarantee that the solution provides the best model of the data Case-Based Reasoning Overview of Case-Based Reasoning Case-based reasoning is a problem solving approach that has received a great deal of attention. It has origins in the work of people such as dynamic memory and the role of previous situations (or cases) in learning and problem solving. Development of the 2-8

9 first true case-based reasoning system, CYRUS, is attributed to Kolodner who used the work of Schank in a basic question and answer system that held knowledge of the various meeting of former US secretary of state Cyrus Vance [Schofield, 98]. And, more accomplished system have been applied to the different domains, examples of dispute resolution, speech recognition, medical diagnosis, and Chinese cooking. For the case-based reasoning term, the case represents problem that has been solved by using particular problem solving mechanism. Every case contains a description of problem and the corresponding solution(s). Once a new case arises, the case-based reasoning will utilize the solutions of retrieval cases from the casesbase to be the predictive basis. The reasoning procedure involves four steps, which are retention, retrieval, revision, and reuse of explicit cases. The concept is drawn in the figure 2-1. Problem New Case Retrieval Learned Case Previous Case New Retrieved Case Case Retain Reuse Tested Case Revise Solved Case Confirmed Solution Proposed Solution Figure 2.1 The CBR cycle [Aamodt 94] 2-9

10 A new problem is described as a case by some attributes. Next, it is compared to the existing cases in the casebase and the most similar cases are retrieved. Theses cases are combined and reused (i.e. adapted) to suggest a solution for the new problem. The solution may be revised according to the current environment if it is not validated. Finally, the revised solution is retained by adding it to the casebase for use in the future. With the viewpoint of managers, the case-based reasoning is easier to use. Its procedure is transparent and easy to understand. It helps the manager to figure out how the system makes the estimation. It widely accepted that effective cost estimation demands more than one technique, and the case-based reasoning has been proven a good candidate technique [Schofield, 95] Conclusion of Related Literatures Recently, many studies apply case-based approach to estimate cost of the software project. Next we outline their methodologies and then give a conclusion. S. J. Delany and P. Cunningham [Delany 00] considered that if a complete representation is not available an automatic reasoning mechanism will not be able to produce good cost estimation. The alternative they proposed is to focus on a measure called the productivity coefficient rather than the expected effort. The coefficient gives a measure of the potential risk revealed by the characteristics of a project compare with previous project experiences. The productivity measured can reference to the following equations. Effort is direct proportion to (Size / Productive), and 2-10

11 Productivity Coefficient = (Actual Case Productivity / Average Productivity) Where Size is measured by LOC, Productivity is the amount of time to develop on LOC, Actual productivity of a project is available after the project is completed, Average productivity is the average development productivity of the organization calibrated across all the cases in the casebase Our previous study [Huang 00] proposed an integrated method that comprises the cluster, rule-based reasoning (RBR) and case-based reasoning (CBR) techniques to estimate the cost of software development. The approach goes through three steps to estimate cost. The first step is to cluster the projects with known K-means method. The second step is to use generic rules that are built beforehand and obtained from experienced managers with domain knowledge to shrink the search rage of CBR. Finally, the CBR is droved to find a most similar case to predict the cost of new project. The most important contribution in this study is to propose the index idea by using clustering mechanism. Such mechanism helps managers to catch up more detail information in each of cluster rather than general information for all cases in the casebase. Such idea is also important basis of this thesis. ESERG has studied CBR for many years [Mair 99, Schofield 95, Schofield 98]. The ESERG mainly proposed a solution called ANaloGy software tool (ANGEL) to execute the task of cost estimation for a software project. The ANGEL uses the optimal set of features of cases in the casebase to search most similar cases to be the based solution for the problem (or target). Then revision strategy of ANGEL is to average the retrieved cases and report it as the estimated result. The concept of 2-11

12 optimal set of features is significant method that we will cite in this thesis. More detail introduction will be described in the chapter four. D. McSherry [McSherry 91] considered that CBR approaches rely on the domain-specific rules to adapt a similar case retrieved from casesbase to match the target. However, the expertise necessary for case adaptation may be difficult to capture in the form of rules. Therefore, McSherry proposed an approach to case adaptation which does not rely on domain-specific rules. The aim of the approach is not to find another novel CBR procedure but to provide an upper bound and a lower bound for the value of target. An overall ranking of the target case in comparison with the two similar cases provides a basis for estimation of the value of the target case by interpolation between its upper and lower bounds. The whole process for searching the bounds according to the following definition: One case dominates another case if its value with respect to at least one case feature is generally preferred to the value of the other case and its values with respect to each of the remaining case attributes is at least equally preferred to the value of the other case. Two cases are incomparable if neither of the two cases dominates the other. Thus, one of two cases retrieved by the definition dominating the target is called the upper bound, and the other is called lower bound. Nevertheless, such definition implies that estimator has to explicitly know the priorities of the relevant features. The approach assumes that estimators have already known the priorities of the features, but they commonly do not know, especially the estimation work is proceeded in a complex environment. 2-12

13 The approach of Vicinanza et al. [Vicinanza 92] is to use domain-specific rules similarity measures (mainly based on the size e.g. LOC and function points) coupled with rule based adaptation. The authors extract the rule by analyzing a protocol of an expert estimating project effort for hypothetical projects. This has been implemented in tools as known cestor. They found their technique outperforms COCOMO and Function Points for Kemerer s (Kemerer, 1987) dataset with an additional seven projects. One disadvantage of this approach is that it is very specific to the particular dataset since the rules and similarity measures are defined in terms of features available. It is unclear how this approach could be easily generated [Kadoda 00]. To understand the current situation of applying case-based reasoning to estimate cost, we analyze the related literatures with four dimensions, which are Decision Information, Rules Revision, Domain Understanding, and Cases Index. The result is shown as table 2-3. Table 2-3. The comparison of CBR based cost estimation Decision Information Rules Revision Domain Understanding Cases Index Delany, 00 Medium No ill-defined No Huang, 00 Low No well-defined Yes Mair, 99 Low No ill-defined No McSherry, 91 High Yes ill-defined No Schofield, 95, 98 Low No ill-defined No Vicinanza, 92 High Yes well-defined No The judgment criteria of each dimension are described as below. 1. Decision Information: for this item, we define it as how much information a system can provide to estimator or manager. Generally speaking, rules make more 2-13

14 sense than pure numeric results. As we mentioned in chapter one, although ANN outperforms the other machine learning techniques in the accuracy aspect, the results it only could provide is numeric value without the reasonable explanations to its inner estimating process. However, managers often need some reasonable reasons to convince him to accept the result. Thus, the explaining ability becomes the important indicator of measuring the quality of estimative model. To avoid too much subjective influence on judgment, we give the criteria exhibited in the figure 2-2. a. IF the system used rules to revise the prediction, THEN we value it with High b. IF the system used plausible mathematic expression to provide some information, THEN we value it with Medium c. IF either the system does not use rule or mathematic expression, THEN we value it with Low Figure 2-2. The criteria of Decision Information We consider that if a system revises the prediction by the rules, then it can provide more information to support manager s decision at some level. Consequently, to exhibit information and its useful degree completely depends on the construction of the rule-base, and the related work could involve knowledge representation. Next, we also approve a system that can express some relationships among features by certain mathematic expression. So we roughly value the system at High, Medium, and Low with the mechanism of revision or expressing degree of relationships among features. 2-14

15 Afterward, we review the literatures, we note that both of McSherry and Vicinanza used the rules to revise their prediction and get value High on this item. But McSherry differs Vicinanza from that McSherry applied his model to a not clear domain rather that a clear domain where Vicinanza applied. In addition, Delany used the plausible mathematic expression to express the relationships among productivity, effort, and project size. Such information may really provide useful decision information to the project manager to consider the effort or duration required afterward. The remaining literatures do not consider adding additional mechanism to provide more information rather than drive prediction making by the pure case-base reasoning process. 2. Rule Revision: here we define the item as whether a system uses the rules to revise the prediction or not. The purpose to use this item is to realize what properties of domain the most researchers assume to deal with. Here the properties of domain we mentioned include ill-defined, semi-defined, and well-defined. Well-defined property means that the domain knowledge is easily to present by rule-like information. Such property can become the basis of generating or even revising the resolution. Thus, well-define domain also can help predictive model proved well-understanding information. On the contrary, ill-defined property indicates that domain knowledge is not easily to clearly exhibit. Therefore, predicative model is difficult to propose the estimated result depends on the all possibly general domain knowledge, i.e. the predictive model can only catch the features of the target to search the mostly possible resolution. And semi-defined is between them. Although some knowledge are not crystal, it still can present them according to some features. Bear that in mind, the constructed rule might not be generalization. These properties indicates one point that do we need additional knowledge or expertise to construct the predictive model. And, we know that 2-15

16 under most situations we have to deal with the unclear domain. According to our investigation, we found that McSherry and Vicinanza have the mechanism of rule revision. And we also notice that the understanding degree of domain is completely different. In Vicinanza s study, the generalized domain knowledge is invoked for constructing the rule-base in the interpreted form. In other words, the understanding degree of domain is an important part to drive the rules to revise the prediction. However, the cost estimation domain is usually ill-defined. Contrary to Vicinanza, McSherry assumed that the domain was ill-defined, and he revised the prediction according to the important degrees of features. 3. Domain Understanding: according to the discussion above, we define this item to approximately realize the situation of domain where we will face. As the result shown in the table 2-3, most studies developed their models in an ill-defined domain. It accounts for that there are not explicit principles existing in the most of cost estimation domains. The situation conforms to our assumption for the domain. i.e. we won t try to entice the generalized domain knowledge to construct the rule-base. 4. Index: here we define the item as whether a system indexes cases or not. In addition to our previous study [Huang 00], the remaining studies did not pay their attention to index cases. The situation can be explained in two reasons. First, it is not necessary to consider the current computational ability, and the effect of index is not assistant to the accuracy of estimation. Therefore, researchers rather made more effort in the other key steps of case-base reasoning. Such as case revision. However, it also causes some problems. For example, if the manager wants to know what else cases are similar to the current problem, we can not only give him a pile of cases and tell him what cases are most similar to the problem according to the system s similarity measurement. Secondly, although cases indices indeed 2-16

17 cannot increase the accuracy of estimation, the index can provide some useful decision information. Citing an instance again, cases were classified into several categorical types according to different sets of features. While the system retrieves some similar cases to estimate the target from one of classes, then we can give an advice to manager that he has to notice some important features within the class that the cases belong to. And such mechanism can be implemented by means of the clustering technology. In our previous study, the rules were used to index the cases for shrinking the search range. Although it did provide some information for manager, a presupposition is that we understand the domain. For conforming to the possibly ill-defined situation, we will propose a clustering approach to index the cases. The detail is going to discuss in chapter three. Readers may notice that we do not analyze the literatures with the accuracy dimension. It is due to that the predictive accuracy of each model was measured by using different casebases, the comparison of them is meaningless. Another reason is that many literatures have proven that case-base reasoning is an outstanding candidate or component to existing techniques. It is unnecessary to provide the result of accuracy to conclude that again [David 98, Finnie 97, Shrinivasan 95] Limitation of Applying CBR Although the concept for estimating by case-based reasoning is relatively straightforward, some limitation of applying the approach needs to be noticed. Case-based reasoning technique requires a considerable amount of history data for providing sufficient information. And it is easily influenced by extreme data so that estimation error is sometimes too large to be accepted. CBR also has scaling problem 2-17

18 while preparing data format. For example, when we compare two quite different types of attributes, it is difficult to scale them in the same unit [Aamodt 94, Delany 98]. In addition, CBR also have an awkward problem to deal with the outlier case. Once such case exists in the casebase, most situations is to eliminate it from the casebase in order to keep the averagely estimating level. Although such situation may happen in the real world Decision Tree Learning Model Another artificial technique is decision tree learning model. In the tree, the nodes represent attributes that are used to separate projects into disjoint subgroups, and the leaves represent the average cost of software development. By descending the path, we could determine the cost of new project. But the attributes in the tree must be determined beforehand, and then recursively partition in each disjoint subgroup by those attributes until no further data could be partitioned. To optimize the efficiency, the attributes selection is important. Various attribute selection procedures have been proposed, such as ID3 [Lee 98]. 2-18