Stat 301 Final Exam December 20, 2013

Size: px
Start display at page:

Download "Stat 301 Final Exam December 20, 2013"

Transcription

1 Stat 301 Final Exam December 20, 2013 Name: INSTRUCTIONS: Read the questions carefully and completely. Answer each question and show work in the space provided. Partial credit will not be given if work is not shown. Use the JMP output. It is not necessary to calculate something by hand that JMP has already calculated for you. When asked to explain, describe, or comment, do so within the context of the problem. Be sure to include units when discussing quantitative variables. 1. [14 pts] This question deals with the three model selection procedures; Forward, Backward and Mixed, in general. a) [3] Explain what the Prob to Enter is. b) [4] For the Forward selection procedure, what is the first variable that will enter the model? c) [3] Explain what the Prob to Leave is. d) [4] For the Backward selection procedure, what is the first variable that will be removed from the model? 1

2 2. [26 pts] On the first two labs this semester we looked at the concentration of Non- Structural Carbohydrates (NSC in mg/g) for trees and shrubs in dry and moist tropical forests. Researchers are interested in using the NSC concentrations to predict the concentration of sugar (in mg/g) for those tropical trees and shrubs. For the multiple regression model the centered value of NSC, NSC , is used. The Forest Indicator is 0 if it is a Moist tropical forest and 1 if it is a Dry tropical forest. Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 87 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model Error Prob > F C. Total <.0001* Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept <.0001* (NSC ) <.0001* Forest Indicator * Forest Indicator*(NSC ) * a) [2] How much variation in sugar concentration can be explained by the model with (NSC ), Forest Indicator and the interaction between these two variables? 2

3 b) [4] Is the model useful? Support your answer with the value of the appropriate test statistic, P-value and conclusion based on the P-value. c) [4] Give an interpretation of the estimated intercept within the context of the problem. d) [4] Give an interpretation of the parameter estimate for (NSC ) within the context of the problem. e) [4] Give an interpretation of the parameter estimate for the Forest Indicator variable within the contest of the problem. f) [4] Because the P-value for the parameter estimate for Forest Indicator*(NSC ) is so small, the Forest Indicator*(NSC ) term is statistically significant. What does this indicate about the relationship between sugar concentration and NSC concentration? 3

4 g) [4] Is there a statistically significant difference in average sugar concentrations for tropical trees and shrubs with mg/g NSC in Moist and Dry tropical forests? Support your answer with the value of the appropriate test statistic, P- value and conclusion based on the P-value. 3. [37 pts] Standardized residuals, leverage (h) values, Cook s D, and Studentized residuals are calculated for each of the n = 87 species using the fitted model in problem 2. a) [5] The distribution of standardized residuals is given below. What does this indicate about the normal distribution condition? Be sure to make specific reference to the plots to support your answer. Standardized Residuals 4

5 b) [3] The plot of standardized residuals versus the type of forest is given below. What does this plot tell you about the conditions necessary for statistical inference? Level n Mean Std Dev Dry Moist c) [5] The four most extreme standardized residuals are given below. Which are statistically significant? Explain briefly. Species Forest Sugar NSC Std Resid, z P-value, z Ocotea sp. 1 Moist Pseudolmedia laevis Moist Zeyheria tuberculosa Dry Pouteria macrophylla Moist d) [3] How large does the leverage, h, have to be to be considered high leverage? 5

6 The ten largest leverage values are given below. The average NSC for trees and shrubs in Dry tropical forests is 56.8 mg/g while the average NSC for trees and shrubs in Moist tropical forests is 72.9 mg/g. The average sugar concentration for Dry tropical forest trees and shrubs is 27.7 mg/g while the average sugar concentration for Moist tropical forest trees and shrubs is 30.3 mg/g. Species Forest Sugar NSC h Sugar F P-value, F Ocotea sp. 1 Moist Ocotea sp. 2 Moist Pourouma cecropiifolia Moist Pouteria nemorosa Moist Swietenia macrophylla Moist Bougainvillea modesta Dry Chrisyophyllum gonocarpon Dry Neea cf. steimbachii Dry Pouteria gardneriana Dry Zeyheria tuberculosa Dry e) [5] Calculate the F statistic for Ocotea sp. 1. f) [4] What species of tropical trees and shrubs have statistically significant leverage values? Explain briefly. g) [3] What is the reason for the statistically significant leverage? 6

7 The 5 trees and shrubs with either the largest values of Cook s D or the most extreme Studentized residuals are given below. Species Forest Cook's D Studentized Resid, t P-value, t Ocotea sp. 1 Moist Pouteria macrophylla Moist Pseudolmedia laevis Moist Pouteria gardneriana Dry Zeyheria tuberculosa Dry h) [2] Which tree/shrub species has the largest value of Cook s D? Give the name of the true/shrub species and the value of Cook s D? i) [2] Is this considered a highly influential value? Explain briefly. j) [2] Which tree/shrub species has the most extreme Studentized residual? Give the name of the tree/shrub species and the value of the Studentized residual. k) [3] Does the tree/shrub species with the most extreme Studentized residual have statistically significant influence? Support your answer. 7

8 4. [48 pts] A random sample of 100 houses was selected from all houses sold in Ames in We are interested in building a model for Sales Price ($1000) based on characteristics of the houses. There are 12 explanatory variables for each house. Lot Area Quality Condition Age Basement Area First Floor Area Second Floor Area Baths Bedrooms Other Rooms Fireplaces Garage Cars Area of the lot in square feet Index of quality 0 = low, 10 = high Index of condition 0 = poor, 10 = excellent Age of house at the time of sale Area of the basement in square feet Area of the first floor in square feet Area of the second floor in square feet Number of baths, Note: a half bath = ½ Number of bedrooms Number of other rooms Number of fireplaces Size of garage in terms of number of cars Below is the output for the Forward selection procedure with Prob to Enter = The C. Total sum of squares is

9 a) [5] What is the best single variable model for predicting Sales Price? How do you know this is the best single variable model? How much variability in Sales Price does this single variable explain? b) [5] What variable was added to the model on the fourth step of the forward selection procedure? What was the P-value for this variable when it was entered? How much additional variability in Sales Price does adding this variable explain, given the variables entered in the first three steps? c) [4] Could the 10 variable model (Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms, Other Rooms, and Garage Cars) be the best model? Explain briefly. 9

10 Below is the initial set up for the Backward selection procedure with Prob to Leave = d) [5] What will be the first variable removed from the full model using the backward selection procedure? Give the P-value associated with this action and indicate how the value of R 2 will change and by how much when that variable is removed. e) [5] After the first two variables are removed by the Backward selection procedure the Current Estimates are the same as those given at the end of the Forward selection procedure (see page 8). What will happen at the next step of the Backward procedure? Explain what will happen and why. 10

11 Running the Mixed procedure with Prob to Enter = Prob to Leave = 0.05 gives the Current Estimates given below. f) [5] Complete the analysis of variance table that corresponds to the model with the terms checked as Entered (Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms and Garage Cars). Source df Sum of Squares Mean Square F Prob > F Model < Error C. Total g) [4] Could the model with the terms checked as Entered be the best model? Explain briefly. 11

12 The All Possible Models option in Stepwise was used to display the top 10 (according to R 2 ) models for 9, 10, and 11 variable models along with the full 12 variable model. Below are summary values of various measures of how well the model fits the data. Use this output when answering parts h) l). Model # variables RSquare RMSE AICc BIC Cp h) [2] Which model has the best R 2 value? Identify by giving the model number, number of variables and value of R 2. 12

13 i) [2] Which model has the best RMSE value? Identify by giving the model number, number of variables and value of RMSE. j) [2] Which model has the best Cp value? Identify by giving the model number, number of variables and value of Cp. k) [2] Which model has the best AICc value? Identify by giving the model number, number of variables and value of AICc. l) [2] Which model has the best BIC value? Identify by giving the model number, number of variables and value of BIC. m) [5] Model 1 includes variables Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms and Garage Cars. Model 11 includes variables Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms, Other Rooms and Garage Cars. Between these two models, which is best according to the definition of best we have been using in this course? Explain briefly. 13