LESSON NO. 5. Sales-Based Modeling

Size: px

Start display at page:

Download "LESSON NO. 5. Sales-Based Modeling"

Rosamond McCarthy
6 years ago
Views:

1 LESSON NO. 5 Sales-Based Modeling Assigned Reading 1. UB Real Estate Division BUSI 444 ourse Workbook. Vancouver, B: UB Real Estate Division. 2. UB Real Estate Division Advanced omputer Assisted Mass Appraisal. Vancouver, B: UB Real Estate Division. hapter 8: Specifying Sales omparison Models hapter 9: Important Tools for Model alibration 3. ity of algary Assessment Department. "Regression Modeling in algary A Practical Approach". Assessment Journal. Vol. 5, Num. 4, August International Association of Assessing Officers. Recommended Reading 1. UB Real Estate Division BUSI 344 ourse Workbook. Vancouver, B: UB Real Estate Division. Lesson No. 8: omprehensive Model Building Data Screening and Testing. 2. UB Real Estate Division Advanced omputer Assisted Mass Appraisal. Vancouver, B: UB Real Estate Division. hapter 11: Sales Analysis and Mass Appraisal Performance Evaluation hapter 12: Statistical Procedures and Performance Evaluation 3. Gloudemans, R. J "omparison of Three Residential Regression Models: Additive, Multiplicative and Nonlinear". Assessment Journal. Vol. 9, Num. 4, July/August A good discussion of regression analysis principles and various methods of including location variation in model. 4. Todora, J. and Whiterell, D "Automating the Sales omparison Approach". Assessment Journal. Vol. 9, Num. 1. January/February Wayne Moore "Performance omparison of Automated Valuation Models". Journal of Property Tax Assessment and Administration. Volume 3, Issue 1, Learning Objectives After completing this lesson, the student should be able to: 1. complete in-depth data screening and exploratory data analysis, building on the preliminary screening techniques demonstrated in Lesson 2, including transforming variables as necessary; 2. apply the multi-step process for regression modeling to specify, calibrate, and test a sales-based valuation model; 3. test the performance of the valuation model using a variety of parametric and non-parametric approaches, and make any necessary refinements; 5.1

2 Lesson No discuss the importance of testing the influence of individual variables not included in the final specified model; and 5. test the model using a hold-out sample (sales data not used to develop the model) to determine its suitability for general valuation. Instructor's omments Let's take stock of our progress through BUSI 444 so far. In Lessons 1 and 2, we reviewed data screening and data revelation techniques using property inventory and sales data. In Lesson 3, we investigated several methods to estimate land values and in Lesson 4 we estimated market values under the cost approach. By now, you should feel comfortable using PASW/SPSS for data screening and analysis functions. We are now ready to progress to the challenge of developing a sales-based additive regression model. In this lesson, we will revisit the Midsize sales database from Lesson 2 to illustrate the steps in building an additive model. The model will be based on direct sales comparison, using a database of market sales. However, the Midsize database was originally developed for use in a cost-based model, so it has a much higher number of inventory related variables than a typical sales-based model. As a result, we will label the model type as a "costspecified market approach". This model type was discussed by Richard Borst, a well-respected AMA expert, at a GIS/AMA conference held in Borst describes the model as a "Transportable ost Specified Market Approach". It is "transportable" because it can be used almost universally, applicable in a wide variety of contexts. It is "costspecified" because it uses primarily the same land and improvement variables that would be used in a cost model. However, it is a "market approach" in that it is calibrated using multiple regression and market sales data. In his 30 years of modeling experience, Borst has concluded this to be an effective model for use in many situations. It is easily understood by appraisers and the general public as it contains most of the property features that are considered to affect value. In this lesson we will use the final version of this database from Lesson 2, Midsize700.sav. The advantage of using this data is that it has already been refined through preliminary data screening, data exploration, and transformation in Lesson 2. We have confirmed the data does not contain any fundamental flaws or issues and is suitable for development of an additive linear regression model. Database for this lesson: we will be continuing with the "Midsize700" database from Lesson 2. If you wish to work with a fresh database, you can download the "Midsize700" database from the course webpage. However, this is not required, since this is the same database we saved at the end of Lesson 2 you can simply continue with the version you saved. Steps in the Model Building Process Building an additive regression model requires a multi-step process. In each step, the analyst must apply consistent methods and be prepared to support all assumptions and decisions. As we will see later in this lesson, there are significant model performance risks associated with certain assumptions and decisions. The key for the analyst is to find the right tools to identify, quantify, and minimize these issues. Model building can be described as three general activities: model specification, calibration, and testing. We will cover each of these activities in some depth, following the 9-step process below. 5.2

3 Sales-Based Modeling 1. Describe an appropriate general model to use and state this model using standard mathematic symbols. 2. Review the variables in the database and identify those which are suitable for this type of model. 3. Examine the potential independent variables for relationships with each other and with the dependent variable using graphic analysis, cross tabs, and correlation analysis. 4. (a) reate any transformations necessary to make variables suitable for the chosen model structure. (b) reate any additional transformations required to remove problems of multi collinearity identified in Step Repeat Step 3 with new variables. 6. List a final group of potential variables for calibration. 7. alibrate the model using an appropriate method. 8. Test and evaluate the model for use. 9. State your conclusions as to model quality. Steps 1 to 6 are model specification: selecting the variables to be considered and defining their relationships to value and to each other. Step 7 is calibration, where you determine the regression equation to value the properties. Steps 8 and 9 are testing and reporting values. One caution before we begin: model building is not simply a mathematical exercise. The modeler must apply appraisal judgement throughout the process of specifying the initial model, calibration, and testing. This adds a degree of subjectivity to the process, which means you cannot specify ironclad decision rules that are universally applicable. Just as in single-property appraisal, there is an element of art to temper the apparent hard science in statistical modeling. Step 1 hoose a Valuation Model The first step in the regression model development process is to specify the model in a general mathematical form. The general model form we will use is as follows: where is the product of the general qualitative components; is the product of the building qualitative components; is the product of the land qualitative components; is the sum of the building additive components; is the sum of the land additive components; and is the sum of the other additions additive components. Since we have additive components for land and improvements, we have essentially specified a cost approach regression model. This must be restated in a format consistent with an additive model as follows: 5.3

4 Lesson No. 5 where is the sum of the general qualitative factors in binary form; is the sum of the building qualitative factors in binary form; is the building qualitative factors in multiplicative form; is the building additive factors; is the land qualitative factors in binary form; is the land qualitative factors in multiplicative form; is the land additive factors; is the other building qualitative factors in multiplicative form; and is the other building additive factors. Step 2 Variable Review In order to use this form of model, we need to review the available independent variables and sort them into appropriate categories. We will catalogue the pool of variables in the Midsize700 database into the following six major variable types. General Qualitative General qualitative variables are intended to capture general locational influences, including neighbourhood. Neighbourhood variables reflect a range of influences associated with a broad geographic area. Location variables reflect more specific positive and negative influences. Typical positive location factors might include parks, greenbelts, and schools. Negative locational influences for single family residential uses include high traffic, noise, or proximity to commercial or industrial uses. Nbhd is the only general qualitative variable in our database, accounting for locational influences. Building Quality Building quality variables include construction quality, depreciation or age factor, and construction type such as ranch style, split level, and two storey. Where construction quality has been scaled, it becomes a multiplicative variable. Similarly, overall depreciation or simply physical depreciation will be multiplicative if calculated as the percentage of remaining life. The Midsize700 variables which fall into this category are as follows: Manuallass Lin_mancls EffectiveYear This is a nominal variable. In Lesson 2 it was transformed to a scaled variable. It can also be expressed as a binary variable. This is the scaled version of Manual lass. This variable is multiplicative and must be combined with related additive variables such as floor areas to be used in an additive model. Effective age is the age of the property based on the observed amount of depreciation from all sources, including physical deterioration and obsolescence. Effective age may not be equivalent to chronological age. For example, renovation and restoration will reduce the effective age of property. This numerical variable can be used directly in the model, converted to age, or used to develop a multiplicative depreciation variable. 5.4

5 Sales-Based Modeling Building Additive Building additive variables include finished floor areas, bathrooms, bedrooms, fireplaces, or any other building feature that is developed by measuring or counting. The following are the additive variables in the original database (in Lesson 2 a number of new variables were created based on these variables): Foundation FinishedArea Stories FullBath ThreeQtrBath HalfBath Bedrooms MultiarGarage SinglearGarage arport Pool OutBuildings Fireplcs BasementTotalArea BasementFinishedArea DeckAreaovered DeckAreaUncovered Land Qualitative Land qualitative variables represent factors such as view, waterfront, topographic features, and level of services. These variables are often binary, indicating the presence or absence of the feature. Other qualitative modifiers for residential land are multiplicative, such as size adjustment factors. Residential land is generally affected by the economic principle of diminishing returns. In other words, as the size of a residential lot increases beyond a certain point, the unit value or price per square foot or front foot tends to decrease. Binary land quality variables in the database include: ornerlot PrimeView GoodView FairView Land Additive The only land additive variable is LotSizeSqft. Other Building Qualitative and Additive Other qualitative and additive building factors are represented by the same types of feature as for main building. These include: MultiarGarage SinglearGarage arport Pool OutBuildings 5.5

6 Lesson No. 5 Steps 3, 4, and 5 Examine Potential Independent Variables and Transform as Necessary We already completed much of the exploratory data analysis and data transformation required for an additive regression model in Lesson 2. In this section, we investigate the need for additional transformations for land and building qualitative factors. The only general quality variable in the midsize700.sav database is Nbhd. General practice in an additive model is to avoid including this factor since Nbhd captures many of the features that would normally be included individually in the model, such as age, quality, and land characteristics. In other words, there is a risk of double counting if variables with overlapping influences are included in a model. To address this issue, common modeling practice is to first calibrate the model with no location variable, and then make subsequent location adjustments using the Nbhd variable factors or response surface analysis. 1 Near the end of this lesson, we will demonstrate how variables excluded from the model, such as Nbhd, can be examined for potential influences and the model adjusted for these. Building quality can be represented in several different ways, as follows: Lin_mancls as a quality factor (multiplicative); EffectiveYear, as a factor representing depreciation; or a new variable, EffectiveAge, created by transformation of EffectiveYear A more complete analysis is needed to determine which approach will result in the best variable. However, EffectiveAge is usually preferable since this factor accounts for the different size and quality of buildings rather than a constant dollar amount based only on age. Another approach is to develop a physical depreciation variable. However, in this case we do not have sufficient data to create depreciation relationships for each manual class. We now turn our attention to the land variables. As noted earlier, the unit value of residential land typically follows a non-linear relationship with increasing size. In other words, each square foot of a very large parcel of land is often worth less than each square foot of a smaller parcel. This is because the usefulness of additional units of land varies depending on how much land is already included in a parcel. A very large parcel would benefit less from an additional unit of land than would a very small parcel. To account for this relationship, a size adjustment is required. In Lesson 3, we investigated several methods for accounting for this non-linear size relationship. All of the methods depended on the creation of a land residual, because a separate land value is a necessary part of a cost model. For our sale-based regression model, a separate land value is not needed, so the following formula will be used (this method is also used later in the nonlinear regression lesson). 1 Response Surface Methodology (RSM) is a collection of statistical and mathematical techniques useful for developing, improving, and optimizing processes. RSM is particularly useful where several variables potentially impact one performance measure or quality characteristic, known as the response. For example, Lin_mancls and EffectiveYear both influence building value. 5.6

7 Sales-Based Modeling We will following the three steps below to create a size adjustment: Step 1: reate a size factor by dividing the mean property size of 7,388.2 square feet by each LotSizeSqft value. Step 2: Take the square root of the outcome from Step 1. Remember that raising a variable to the power of 0.5 is the same as taking the square root. Step 3: Raise the result from Step 2 by E=1.2 (i.e., to the exponent 1.2). This E=1.2 value was found by trial and error, trying different exponents in a plot of sizefact against LotSizeSqft until we found the expected or desired curve relationship between value and lot size. This process will not be shown here. These steps create a scaled adjustment factor that can be applied to properties with greater or lesser square foot areas than the average lot. We will create a new variable adjlotsize with the following syntax commands: OMPUTE sizefact = (7388.2/LotSizeSqft)**(1.2*0.5). OMPUTE adjlotsize = sizefact*lotsizesqft. These results are shown in the graphs displayed above. The graph of sizefact and LotSizeSqft shows an excellent curve shape, but the curve in Adjlotsize and LotSizeSqft does not level off at the higher values as is usually expected. This may result in over-valuation of the larger lots. We created a variable for finished upper floor area in Lesson 2. In a cost-specified model, it is generally assumed that not all finished area above the basement will have the same unit value; for example, first floor finished area normally has a higher square foot value than second floor finished area. Therefore, if the first floor area is not valued separately in the model, the first floor area will be under-valued while the second floor area will be over-valued. Our hypothesis is that it is better to attempt to differentiate between the value of the first floor area and the second floor area in the model than to model only the total finished floor area for all floors. We will apply the following transformations to create an area variable for each floor: OMPUTE flrarea1 = upperfinish / Stories. OMPUTE flrarea2 = upperfinish! flrarea1. 5.7

8 Lesson No. 5 These transformations assume that 1.5 storey homes have a second floor area exactly 50% of the first floor and 2 storey homes have a second floor area equal to the first. We can test the final value after modeling is completed using the storey binary variables. Floor area can now be represented in the model by either the total finished floor area or the two separate floor areas. Bathrooms can be represented by the three separate variables in the original data or the new totalbath variable created in Lesson 2. In order to include the building quality factor Lin_mancls in the model, we must transform this variable by multiplying it with the floor area variables: OMPUTE lin1area = Lin_mancls * flrarea1. OMPUTE lin2area = Lin_mancls * flrarea2. Some of the nominal variables, such as Foundation, Pool, and Stories, were transformed in Lesson 2 to a format that can be used in the model. However, OutBuildings was not transformed. A simple recoding transforms the OutBuilding values from a Y or N "string" or text to numeric binaries. The syntax for this transformation is provided below: REODE OutBuildings ("Y"=1) (else=0) into Outbldgbin. The next transformation creates the effective age variable, which will serve as a proxy for physical and functional depreciation. Since the valuation base is 2006 and no properties were built later than 2005, we can create an effective age variable, effage, as follows: OMPUTE effage = 2006! EffectiveYear. Later in the lesson, we will illustrate how an additive depreciation variable can be created using effective age and economic life relationships. Step 6 List Variables for alibration Based on the results of the first five steps in the model building process, the following variables are considered for calibration: Bedrooms MultiarGarage SinglearGarage arport OutBldgbin ornerlot adjlotsize Fireplcs DeckAreaover ed DeckAreaUncov ered poolyes effage totalbath story15 story20 linfinarea lin1area lin2area linbsmtfin totalbath FullBath ThreeQtrBath HalfBath crawl partbsmt slab 5.8

9 Sales-Based Modeling The three view variables were tested using rosstabs and only FairView had any observations with a "Y" coded; there were only two sales with a fair view. This is too few for the variable to be of any use. The minimum required is five sales with the feature or five without it. We will use multiple regression analysis to test these variables and find the combination that best explains the variation in sale prices. In order for the regression process to work effectively, obvious multicollinearity must be avoided. For example, the separate floor area variables cannot be used with total finished area nor can the separate bath variables be used with total baths. In addition, when a group of binary variables represent all potential values of the original variable, as with the foundation types, one of the group must be omitted to act as a reference or control variable. Full basement will be omitted for this reason. We will combine the variables into four groups, using different floor area and bathroom combinations, and determine which combination is optimal for further analysis and model calibration. Each group will contain the following 17 common variables: adjlotsize, slab, poolyes, partbsmt, SinglearGarage, arport, DeckAreaovered, Fireplcs, crawl, DeckAreaUncovered, Bedrooms, MultiarGarage, effage, Outbldgbin, story15, story20, and corner. The groups will be "customized" with the following variables: Group 1 Bath variables: Floor area variables: Group 2 Bath variables: Floor area variables: Group 3 Bath variables: Floor area variables: Group 4 Bath variables: Floor area variables: HalfBath, ThreeQtrBath, FullBath linfinarea totalbath linfinarea totalbath lin1area, lin2area, linbsmtfin HalfBath, ThreeQtrBath, FullBath lin1area, lin2area, linbsmtfin These variable groups meet the criteria specified in the model specification process and follow sound appraisal judgement. We will now test each group to select the best complete set of variables for model calibration. Keep in mind that we may need to retain one or more variables to aid the model's explainability from an appraisal perspective even though regression diagnostics indicate they provide little or no statistical benefit. We will evaluate the performance for each group with the following statistics: R 2, adjusted R 2, SEE, and F. The best model will be the one with the highest adjusted R 2 and lowest SEE. However, since multicollinearity is a concern, we also need to pay attention to the VIF statistic. All four groups were tested using the "Enter" method for multiple regression. To follow along in PASW/SPSS, complete the following commands: Select Analyze Regression Linear... Select Adj_Price as the dependent variable. Select all common variables and the unique variables for Groups 1-4 as the independent variables. 5.9

10 Lesson No. 5 Method should be "Enter". lick Statistics... and select Estimates, Model fit, Descriptives, ollinearity diagnostics. ontinue OK to run the regression. Only Group 1 results are shown below, with the other groups summarized in the following table. Group 1 Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate a a. Predictors: (onstant), DeckAreaUncovered, poolyes, partbsmt, corner, SinglearGarage, HalfBath, slab, story15, ourbldgbin, adjlotsize, arport, Bedrooms, ThreeQtrBath, Fireplcs, crawl, DeckAreaovered, story20, MultiarGarage, linfinarea, effage, FullBath ANOVA b Model Sum of Squares df Mean Square F Sig. 1 Regression 4.143E E a Residual 1.059E E9 Total 5.202E a. Predictors: (onstant), DeckAreaUncovered, poolyes, partbsmt, corner, SinglearGarage, HalfBath, slab, story15, ourbldgbin, adjlotsize, arport, Bedrooms, ThreeQtrBath, Fireplcs, crawl, DeckAreaovered, story20, MultiarGarage, linfinarea, effage, FullBath b. Dependent Variable: Adj_Price oefficients a Unstandardized oefficients Standardized oefficients ollinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (onstant) adjlotsize slab poolyes partbsmt Single ar Garage ar Port Deck Area overed Fireplcs crawl Deck Area Uncovered Bedrooms Multi ar Garage effage Outbldgbin story story corner Half Bath ThreeQtr Bath Full Bath linfinarea

11 Sales-Based Modeling The following table summarizes the general regression statistics for all groups. Model Summary ANOVA Group No. Adj R 2 SEE F Sig In comparing models, the most important statistics are the adjusted R 2 and SEE. At this stage of the analysis, the best model is the one with the highest adjusted R 2 and lowest SEE. All four models are very similar for these statistics. The F statistic measures performance of the overall model when compared to the result that would be obtained by estimating the sale price by simply using the mean sale price. With the large number of sales used here, and the relatively small number of variables in the model, the F value is about what would be expected. The Sig. should be less than.05. Here, the F statistics for all groups are well above 4.0, giving us confidence that the model is significant in predicting sale price at the 95% confidence level. Digging a little deeper, the VIF statistics indicate some issues with multicollinearity. Most of the VIF values are below the threshold, but some have very high VIFs, indicating extreme multicollinearity. A review of the statistical output for extreme VIF values reveals story20 and lin2area in groups 3 and 4. In addition, there are problems with the bath, main floor area, and effage variables in all groups. Multicollinearity Many statistical software packages will produce statistics which help identify multicollinearity at the time of running a regression process. These statistics are the Tolerance and the Variance Inflation Factor (VIF). As VIF = 1 Tolerance, only one needs to be examined. A Tolerance (VIF) statistic is calculated for each independent variable included in the model; for a given independent variable, the Tolerance is (1 R 2 ), where R 2 is the correlation between the given variable and the rest of the independent variables. If R 2 is zero (that is no correlation is present between the given independent variable and the remaining independent variables) then the tolerance is 1 (maximum value). As R 2 ranges between 0 and 1, the minimum Tolerance would be zero. A Tolerance value of 0.3 or less (VIF greater than 3.333) can indicate that multicollinearity exists in the model. In that case, the independent variables in the model should be examined and new models should be tried removing one or more variables to eliminate the problem. Our next steps will involve removing variables to see if the multicollinearity can be reduced or eliminated. Appraisal judgement tells us that effective age and floor area are both very important valuation variables, so we will try removing the bath variables first. This leaves two variable groups to examine: Group 5: all the common variables, plus lin1area, lin2area, and linbsmtfin Group 6: all the common variables, plus linfinarea 5.11

12 Lesson No. 5 Group 5 Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate a a. Predictors: (onstant), lin2area, poolyes, DeckAreaUncovered, corner, story15, slab, partbsmt, adjlotsize, ourbldgbin, SinglearGarage, arport, linbsmtfin, Fireplcs, DeckAreaovered, crawl, lin1area, Bedrooms, MultiarGarage, effage, story20 ANOVA b Model Sum of Squares df Mean Square F Sig. 1 Regression 4.159E E a Residual 1.044E E9 Total 5.202E a. Predictors: (onstant), lin2area, poolyes, DeckAreaUncovered, corner, story15, slab, partbsmt, adjlotsize, ourbldgbin, SinglearGarage, arport, linbsmtfin, Fireplcs, DeckAreaovered, crawl, lin1area, Bedrooms, MultiarGarage, effage, story20 b. Dependent Variable: Adj_Price oefficients a Unstandardized oefficients Standardized oefficients ollinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (onstant) Bedrooms Fireplcs crawl partbsmt slab story story poolyes adjlotsize effage ourbldgbin MultiarGarage SinglearGarage arport corner DeckAreaovered DeckAreaUncovered linbsmtfin lin1area lin2area a. Dependent Variable: Adj_Price 5.12

13 Sales-Based Modeling Group 6 Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate a a. Predictors: (onstant), linfinarea, slab, poolyes, adjlotsize, corner, partbsmt, story15, ourbldgbin, SinglearGarage, arport, DeckAreaovered, Fireplcs, crawl, DeckAreaUncovered, Bedrooms, story20, MultiarGarage, effage ANOVA b Model Sum of Squares df Mean Square F Sig. 1 Regression 4.130E E a Residual 1.073E E9 Total 5.202E a. Predictors: (onstant), linfinarea, slab, poolyes, adjlotsize, corner, partbsmt, story15, ourbldgbin, SinglearGarage, arport, DeckAreaovered, Fireplcs, crawl, DeckAreaUncovered, Bedrooms, story20, MultiarGarage, effage b. Dependent Variable: Adj_Price oefficients a Unstandardized oefficients Standardized oefficients ollinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (onstant) Bedrooms Fireplcs crawl partbsmt slab story story poolyes adjlotsize effage ourbldgbin MultiarGarage SinglearGarage arport corner DeckAreaovered DeckAreaUncovered linfinarea a. Dependent Variable: Adj_Price The results are summarized in the table below: Model Summary ANOVA Group No. Adj R 2 SEE F Sig

14 Lesson No. 5 The R 2 and SEE statistics for Group 5 are very similar to Group 6, meaning the models for both groups appear to be reasonable predictors of sale price. Moving to the individual variable statistics, there is still considerable multicollinearity. Group 5 has extreme results with story20, effage, lin1area, and lin2area. Group 6 has high multicollinearity with effage and linfinarea but not as extreme as in Group 5. We will continue our testing with Group 6. In group 6 we see that linfinarea and effage have VIFs above the critical value, but only marginally. Both variables have Sig. values of.000, indicating a high likelihood they are important to the model. Because they are not extreme VIFs and our appraisal judgement indicates they are important, we will leave these variables in for now, and revisit their possible multicollinearity again later in the testing process. The next issue we need to tackle is a series of unexpected (and in some cases negative) coefficients for garages and decks. For example, our appraisal sense tells us that a single car garage should add value rather than detract. We need to draw upon appraisal knowledge to solve the dilemma of the illogical garage and carport coefficients. Let's assume that past experience indicates that multi-car garages are worth approximately 1.75 times a single car garage and carports are worth approximately 0.3 times a single car garage 2. Similarly, experience shows uncovered decks are worth approximately 0.75 as much as covered decks. Transformations will be run to reflect these relationships and see if the model result improves. OMPUTE GARAGES = (1.75*MultiarGarage+SinglearGarage+.3*ARPORT). OMPUTE DEKS = (DeckAreaovered+.75*DeckAreaUncovered). We replace the MultiarGarage, SinglearGarage, ARPORT, DeckAreaovered, and DeckAreaUncovered variables with GARAGES and DEKS. This will be Group 7, with the following regression results. Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate a a. Predictors: (onstant), decks, poolyes, corner, partbsmt, adjlotsize, slab, garages, story15, ourbldgbin, Bedrooms, Fireplcs, crawl, story20, effage, linfinarea ANOVA b Model Sum of Squares df Mean Square F Sig. 1 Regression 4.115E E a Residual 1.087E E9 Total 5.202E a. Predictors: (onstant), decks, poolyes, corner, partbsmt, adjlotsize, slab, garages, story15, ourbldgbin, Bedrooms, Fireplcs, crawl, story20, effage, linfinarea b. Dependent Variable: Adj_Price 2 Model calibration assumptions should be based on consistent valuation research. The explainability and credibiltiy of a model will be weakened if the analyst cannot provide empirical evidence for key assumptions. 5.14

15 Sales-Based Modeling oefficients a Unstandardized oefficients Standardized oefficients ollinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (onstant) Bedrooms Fireplcs crawl partbsmt slab story story poolyes adjlotsize effage ourbldgbin corner linfinarea garages decks a. Dependent Variable: Adj_Price In this model we find the following: slight decline in R 2 and increase in SEE statistics; Bedrooms have an unexpected negative coefficient and low magnitude; except for linfinarea the relative magnitude of the other coefficients meets our expectations; Decks have a t-statistic of.243 and a Sig. value of 0.808, indicating a high probability that decks are not significant in the model; Outbldgbin also has a very low t-statistic of.073 and a Sig. Value of.942 indicating a high probability that outbuildings are not significant in the model; the highest Beta value is for linfinarea, which is a combination of quality and total floor areas. This result is expected since our appraisal sense tells floor area should contribute most to value and hence be most significant; and the lowest Beta values are those for DEKS and Outbldgbin. These also have high Sig. values, meaning they contribute little to the predictive ability of the model. orner and partbsmt also have marginal Sig. results, at.255 and.393, respectively. However, we will leave these variables in the model for now, for testing in stepwise regression. Before we proceed to Step 7, Model alibration, we must first separate our data into model and test databases. The test database will be used later in Step 8, Model Testing. The general practice when constructing a model with a large number of sales is to hold out a small proportion of the sales so that there are some sales to act as an unbiased test of the model quality. The goal is to test performance of a model by comparing the estimates produced by it to actual sales observations that were not used in creating the model. The test database is often called a "holdout sample". The model should only be tested in this manner if there are sufficient sales remaining in the model database to calibrate the model. As few as 30 sales may be sufficient to calibrate a simple model. However, typical practice is to ensure at least 5 sales for each variable in the model for statistically reliable outcomes. The following heuristics or "rules of thumb" are generally applied in model calibration: at least 30 sales will generally be required to calibrate a simple model; and there should be at least 5 sales for each variable in the model. 5.15

16 Lesson No. 5 Since the midsize700.sav database contained 54 original variables, the desired minimum number of sales for calibration is 270. We will hold out 200 sales for the test database and leave 500 sales for calibration (or model building database). omplete the following steps. Open the midsize700.sav database and sort the data by RANDOM. Note: the random variable has been previously created in PASW/SPSS to ensure the data is in random order. Select Data Sort ases choose Random from the variable list choose Ascending OK. Save the current database, then use your mouse to highlight the rows 501 to 700. Delete these and save the remainder as midsizemodel.sav using File Save As. aution: remember to use the Save As function rather than Save, to avoid erasing the midsize700 database: This model database will be used to calibrate the model and for initial testing. Reopen the midsize700.sav database and highlight rows 1 to 500, delete these, and save the remaining 200 sales as midsizetest.sav. aution: remember to use the Save As function. This test database will be used for model testing, using sales not used in calibrating the model. Step 7 Model alibration We will now begin to calibrate the model using stepwise regression on the model database. The mechanics of stepwise regression are discussed in hapter 9 of the Advanced omputer Assisted Mass Appraisal text. Our goal is to progressively eliminate any model variable which is not significant in other words, remove variables that do not add to the model's predictive power. Open the midsizemodel database. Select Analyze Regression Linear. Under Method, select Stepwise. Select the final group of variables from Step 6 above: Dependent variable Adj_Price; Independent variables adjlotsz, slab, poolyes, partbsmt, Fireplcs, crawl, Bedrooms, effage, outbldgbin, story15, story20, corner, linfinarea, DEKS, GARAGES. lick Options. Under Stepping Method riteria, Use Probability of F, set Entry to 0.30 and Removal to This sets the entry and removal threshold higher than the PASW/SPSS default limits. This is a less restrictive setting, meaning it allows more variables into the model. The enter value limits the entry of a variable into the model when the Sig. value is greater than Once a variable is in the model, the Sig. value can change as other variables enter -- if a Sig. value of a variable in the model increases beyond the remove value 0.35, then that variable will be removed from the model. lick Statistics. Select Estimates, Model fit and ollinearity diagnostics. Run the regression. We will reproduce only the model summary and final steps for ANOVA and oefficients. 5.16

17 Sales-Based Modeling Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1.797(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) a Predictors: (onstant), linfinarea b Predictors: (onstant), linfinarea, adjlotsize c Predictors: (onstant), linfinarea, adjlotsize, effage d Predictors: (onstant), linfinarea, adjlotsize, effage, story20 e Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl f Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab g Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab, poolyes h Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab, poolyes, Fireplcs i Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab, poolyes, Fireplcs, story15 j Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab, poolyes, Fireplcs, story15, Bedrooms k Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab, poolyes, Fireplcs, story15, Bedrooms, GARAGES l Predictors: (onstant), linfinarea, adjlotsize, effage, story20, crawl, slab, poolyes, Fireplcs, story15, Bedrooms, GARAGES, partbsmt In the Model Summary, notice that the R 2 and adjusted R 2 increase and SEE decreases during the progression of steps. This shows all variables in the model tend to improve the result although the benefit of additional variables drops off significantly at step 9 as bedrooms, garages, and partbsmt are added. ANOVA(m) Model Sum of Squares df Mean Square F Sig. 12 Regression (l) Residual Total l Predictors: (onstant), linfinarea, ADJLOTSZ, effage, story20, crawl, slab, poolyes, Fireplcs, story15, Bedrooms, GARAGES, partbsmt m Dependent Variable: Adj_Price 5.17

18 Lesson No. 5 oefficients(a) Unstandardized oefficients Standardized oefficients ollinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF (onstant) linfinarea adjlotsize effage story crawl slab poolyes Fireplcs story Bedrooms GARAGES partbsmt a Dependent Variable: Adj_Price Additional details revealed in the oefficients table above are: exclusion of the DEKS and Outbldgbin variables, as expected, but corner was also excluded at this level. As the Beta value for corner is low and the Sig. value is.335, we will not attempt to force this into the model; Sig. values have improved for partbsmt; no variables were above the desired 90% confidence level (Sig. of 0.10 or greater); effage still shows a VIF value greater than the desired or 30% tolerance but the beta value is indicating roughly 26% of the value is explained by this variable, far too high for exclusion; and linfinarea, similar to effage, has a high VIF score and high Beta. In the above test we found one statistic for effage which suggested removal from the model and another which indicated the variable should be retained. This issue is a common problem faced by modelers. It is important to make a conscious decision as to which statistic should be emphasized when the results indicate conflicting conclusions. The final step in model calibration is to identify outliers or sales which have high residual values (the difference between the predicted value and the actual sale price of each record in the sales database). Our threshold for outliers will be any sales with residual values which lie outside ±3 standard deviations from the mean predicted value. Our strategy will be to "prune" or remove these sales from the model. There is no generally accepted threshold for what is considered an outlier and what sales should be eliminated this is a decision of the modeler, depending on the circumstances. For example, with a large database, outliers may have little impact on model predictability and can possibly be ignored. To identify outliers we re-run our model using the variables identified by stepwise regression, but with the Method set to Enter and with the asewise Diagnostics report selected: Select Analyze Regression Linear. hange Method to Enter. Remove Outbldngbin, decks, and corner from the list of dependent variables. Select Statistics asewise Diagnostics and set Outliers outside 3 standard deviations ontinue. 5.18

19 Sales-Based Modeling lick Save and select Standardized under the Residuals heading. ontinue OK. The model summary and other reports will be the same as above since no variables have changed. Our interest is the asewise Diagnostics report, which shows 5 sales with high residual values. asewise Diagnostics a ase Number Std. Residual Adj_Price Predicted Value Residual a. Dependent Variable: Adj_Price The next step is to set a filter to eliminate these sales from the calibration process. Data Select cases. Select If condition is satisfied If... Set filter to ABS[ZRE_1]<3. ontinue OK to re-run the regression. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate a a. Predictors: (onstant), garages, slab, poolyes, partbsmt, adjlotsize, story15, Bedrooms, Fireplcs, crawl, story20, effage, linfinarea b. Dependent Variable: Adj_Price ANOVA b Model Sum of Squares df Mean Square F Sig. 1 Regression 2.987E E a Residual 6.696E E9 Total 3.656E a. Predictors: (onstant), garages, slab, poolyes, partbsmt, adjlotsize, story15, Bedrooms, Fireplcs, crawl, story20, effage, linfinarea b. Dependent Variable: Adj_Price 5.19

20 Residuals Statistics a Minimum Maximum Mean Std. Deviation N Lesson No. 5 oefficients a Unstandardized oefficients Standardized oefficients ollinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (onstant) adjlotsize slab partbsmt crawl poolyes Fireplcs Bedrooms effage story story linfinarea garages a. Dependent Variable: Adj_Price Predicted Value Residual Std. Predicted Value Std. Residual a. Dependent Variable: Adj_Price The Model Summary shows the R 2 and SEE are improved with the removal of the five outliers. The residual statistics show the standard deviation of the residuals now falls within a range of to 2.913, within our objective of ± 3 standard deviations. This completes the calibration process. In order to test the model, predictive values must be calculated. PASW/SPSS provides a feature to save the values generated by the model, but this only provides values for the sales used in the final calibration. As outliers in the regression may not be outliers in the valuation process, it is normal to apply the model to all sales in the model database. We will calculate values with the transformation below. This will also be useful in applying the model to the other sale databases. OMPUTE AMRAVAL = * poolyes * adjlotsize * slab * partbsmt * garages *story * Bedrooms * Fireplcs * crawl * story * effage * linfinarea. OMPUTE amraasr = amraval / adj_price. Note: you can double check your transformation for AMRAVAL using Descriptive Statistics. The mean of AMRAVAL should be very close to the mean of Adj_Price (the dependent variable). ompletion of these transformations will allow you to proceed with the initial testing of the model which we are now ready to do. Remove the filter: Data Select ases All ases. Save your syntax file as "midsize.sps", as you will need it later in the lesson. 5.20

21 Sales-Based Modeling Step 8 Model Testing We will do our initial testing using the model database, then later on test results in the holdout sample. Our first tests are the overall valuation level and dispersion. Remember to change the filter back to Select All ases. We will use Ratio Statistics with AMRAVAL as the Numerator and Adj_Price as the Denominator. Under Statistics, select mean, median, confidence intervals (95%), minimum, maximum, OD, and uncheck any others. Ratio Statistics for AMRAVAL / Adj_Price Mean % onfidence Interval for Lower Bound.998 Mean Upper Bound Median % onfidence Interval for Median Lower Bound.988 Upper Bound Actual overage 95.6% Minimum.704 Maximum oefficient of Dispersion.064 The confidence interval for the median is constructed without any distribution assumptions. The actual coverage level may be greater than the specified level. Other confidence intervals are constructed by assuming a Normal distribution for the ratios. The Ratio Statistics show the mean and median are very near 1.00 and the confidence intervals for both statistics include the target of The OD is within the IAAO standard of 5 to 10% for homogeneous areas. While this outcome is acceptable, we will continue testing the impact of individual variables not included in the model. Neighbourhood Adjustments As indicated earlier, Nbhd was not part of the model process, so it will be necessary to test to ensure all neighbourhoods are valued at the same level. In Lesson 2, we learned that the Kruskal-Wallis (K-W) test will tell us whether the property groups, in this case neighbourhoods, have the same level of assessment. We are looking for two outcomes of this test to confirm that the level of assessment is similar in all neighbourhoods: 1. The expected value of the mean ranks for each neighbourhood should be in the centre of the distribution. The test assigns a rank to each observation in order from least values to greatest values. The sum of the ranks is calculated for each group and the mean of the ranks determined. The mean ranks for each neighbourhood should be approximately equal to half the number of sales, or in this case, The chi-square statistic can be used to approximate the significance of the K-W test, given the confidence level and degrees of freedom. In this case, our target is a 95% confidence level and the degrees of freedom statistic is 2. The Asymp. Sig. calculates the probability associated with the calculated chi-square statistic. If the Sig. is above the 5% threshold, we can accept the null hypothesis that the level of assessment is similar for all neighbourhoods. To run a K-W test, complete the following PASW/SPSS commands: Analyze Nonparametric tests K Independent Samples. Select AMRAASR as the Test Variable and NBHD as the Grouping Variable. lick Define Range and enter 36 and 46 as the Minimum and Maximum. ontinue Ensure Kruskal-Wallis H is checked OK to run. 5.21

STATISTICS PART Instructor: Dr. Samir Safi Name:

STATISTICS PART Instructor: Dr. Samir Safi Name: ID Number: Question #1: (20 Points) For each of the situations described below, state the sample(s) type the statistical technique that you believe is the