Data from a dataset of air pollution in US cities. Seven variables were recorded for 41 cities:

Size: px
Start display at page:

Download "Data from a dataset of air pollution in US cities. Seven variables were recorded for 41 cities:"

Transcription

1 Master of Supply Chain, Transport and Mobility - Data Analysis on Transport and Logistics - Course Partial Exam Lecturer: Lidia Montero November, 10th 2016 Problem 1: All questions account for 1 point Dataset: usair Data from a dataset of air pollution in US cities. Seven variables were recorded for 41 cities: SO2: Sulphur dioxide content of air in micrograms per cubic meter NegTemp: Average annual temperature less than -1 Fahrenheit degrees Manuf: Number of manufacturing enterprises employing 20 or more workers Pop: Population size (1970 census) in thousands Wind: Average annual wind speed in miles per hour Precip: Average annual precipitation in inches Days: Average number of days with precipitation per year. Source Everitt, B.S. (2005), An R and S-PLUS Companion to Multivariate Analysis, Springer Load usair.rdata file in your current R or RStudio session 1. Pop contains the description of thousands of inhabitants for the cities included in the data set. Create a new factor variable consisting on an indicator for small, medium and large cities (named it f.size). Small cities are those with less than half million inhabitants, medium cities are those in the range from half medium to one millium and a half and large cities have a number of inhabitants greater than one million and a half. library(car) library(factominer) load("usair.rdata") summary(usair) SO2 Neg.Temp Manuf Pop Min. : 8.00 Min. : Min. : 35.0 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : Wind Precip Days Min. : Min. : 7.05 Min. : st Qu.: st Qu.: st Qu.:103.0 Median : Median :38.74 Median :115.0 Mean : Mean :36.77 Mean : rd Qu.: rd Qu.: rd Qu.:128.0 Max. : Max. :59.80 Max. :

2 usair$f.size<-factor(cut(usair$pop,breaks=c(0,500,1500,3500)),labels=c("small","medium","large")) table(usair$f.size) Small Medium Large *table(usair$f.size)/nrow(usair) Small Medium Large Our target is defined as SO2. Summarize numerically and graphically the response variable. Make an interpretation of the results. Do you think that SO2 may be considered normally distributed? By graphical inspection throu an histogram with normal curve overlayed, clearly SO2 target is not normally distributed. Shapiro-Wilk test rejects the normality hypothesis. mm<-mean(usair$so2) ssdd<-sd(usair$so2) hist(usair$so2, freq=f) curve(dnorm(x,mm,ssdd), col="red", lwd=2,add=t) Histogram of usair$so2 Density usair$so2 2

3 shapiro.test((usair$so2)) Shapiro-Wilk normality test data: (usair$so2) W = , p-value = 9.723e Calculate the upper threshold to identify severe outliers for SO2. Is there any city satisfying this criteria? Are global outliers retained once city size factor is considered? The upper threshold for severe outliers is at 3 times IQR from Q3, thus according to the summary of S02, 101 mg/m3. Chicago is the only severe outlier and the biggest city. Once the target is examined for each group defined by the city factor size, Chicago is not an outlier. is a small city with a very high sulphur dioxide content in air, clearly a severe outlier in its group. In the medium size group, three group outliers in the high SO2 contents appear: Cleveland, and St Louis par(mfrow=c(1,2)) ss<-summary(usair$so2);ss Min. 1st Qu. Median Mean 3rd Qu. Max # Upper threshold utso2<-ss[5]+3*(ss[5]-ss[2]);utso2 3rd Qu. 101 Boxplot(usair$SO2,labels=row.names(usair)) [1] "Chicago" "Philadelphia" "" abline(h=101,col="red",lwd=3) Boxplot(usair$SO2~usair$f.size,labels=row.names(usair),col=heat.colors(3)) 3

4 Chicago usair$so Philadelphia usair$so Cleveland St. Louis Small Large usair$f.size [1] "" "St. Louis" "Cleveland" "" 4. Which are the numerical variables statistically associated with the response (SO2)? Indicate the suitable measure of association and/or tests that support your answer. Assess linearity association to SO2 for available variables. Spearman correlations between target and the rest of the variables have to be calculated. The most associated variables are NegTemp (average annual temperature less than -1F) and average number of days with rain/snow. Null hypothesis are postulated for each of the six possible pairs and tested with cor.test(), pvalues under 0.05 are found for Neg.Temp, Manuf, Pop and Days, thus significant statistical association is found. round(cor(usair[,1:7],method="spearman"),dig=3) SO2 Neg.Temp Manuf Pop Wind Precip Days SO Neg.Temp Manuf Pop Wind Precip Days cortest<-rep(0,6) for (j in 2:7 ) {cortest[j-1]<-cor.test(usair[,1],usair[,j],type="spearman")$p.value} names(cortest)<-names(usair)[2:7] sort(cortest) 4

5 Manuf Pop Neg.Temp Days Wind e e e e e-01 Precip e-01 plot(usair[,1:7]) SO Neg.Temp Manuf Pop Wind Precip Days Describe the profile for the SO2 numeric target using the tools available in FactoMineR package. An alternative answer could be given using condes() procedure in FactoMineR. Target is the first variable in dataset. Global association with the target is significant for Manuf, Pop, Neg.Temp and Days. Global association between SO2 target and the city size factor is significant. In the category output, the mean of SO2 for the group of Large cities (3 obs) is significantly 29.7 units over the grand mean library(factominer) condes(usair,1) $quanti correlation p.value Manuf e-06 Pop e-03 Neg.Temp e-03 Days e-02 $quali 5

6 R2 p.value f.size $category Estimate p.value Large The average SO2 in the cities can be argued to be the same for all city size levels (f.size)? Which are the groups that show a significant greater average SO2 than the others? Non-parametric test on means for SO2 according to f.size have to be used. The null hypotesis of equal SO2 in all groups has a p.value of according to Kruskal-Wallis test. It is on the limit, but since sample size is small we have to be cauteous and reject the null hypothesis, thus there is some average mean group different from the others. With the pairwise Wilcoxon tests pvalues for Mean S02(Large)=Mean S02(Small) and Mean S02(Large)=Mean S02(Medium) are rejected, but means for Small and Medium cities are clearly equal, so the mean sulphur dioxide content in air for large cities is clearly different (and higher) than those for small and medium cities. Boxplot(SO2~f.size,data=usair) SO Cleveland St. Louis Small Medium Large f.size [1] "" "St. Louis" "Cleveland" "" kruskal.test(so2~f.size,data=usair) 6

7 Kruskal-Wallis rank sum test data: SO2 by f.size Kruskal-Wallis chi-squared = , df = 2, p-value = oneway.test(so2~f.size,data=usair) # Not suitable One-way analysis of means (not assuming equal variances) data: SO2 and f.size F = , num df = , denom df = , p-value = with(usair,pairwise.wilcox.test(so2,f.size)) Warning in wilcox.test.default(xi, xj, paired = paired,...): cannot compute exact p-value with ties Warning in wilcox.test.default(xi, xj, paired = paired,...): cannot compute exact p-value with ties Warning in wilcox.test.default(xi, xj, paired = paired,...): cannot compute exact p-value with ties Pairwise comparisons using Wilcoxon rank sum test data: SO2 and f.size Small Medium Medium Large P value adjustment method: holm with(usair,pairwise.t.test(so2,f.size)) Pairwise comparisons using t tests with pooled SD data: SO2 and f.size Small Medium Medium Large P value adjustment method: holm 7. The variance of SO2 in the cities can be argued to be the same for all city size levels (f.size)? Which are the groups that are likely to have a greater dispersion of SO2 than the others? 7

8 Non-parametric test on variances for SO2 according to f.size has to be used. The null hypotesis of equal variance for SO2 in all groups has a p.value of according to Fligner-Killeen test. Thus we do not have evidence to reject the null hypothesis, and thus, it is accepted. Dispersion of the target does not depend on city size group Boxplot(SO2~f.size,data=usair) SO Cleveland St. Louis Small Medium Large f.size [1] "" "St. Louis" "Cleveland" "" fligner.test(so2~f.size,data=usair) Fligner-Killeen test of homogeneity of variances data: SO2 by f.size Fligner-Killeen:med chi-squared = , df = 2, p-value = bartlett.test(so2~f.size,data=usair) # Not suitable Bartlett test of homogeneity of variances data: SO2 by f.size Bartlett's K-squared = , df = 2, p-value =

9 8. Consider a multiple regression model (m1) for target SO2 on all numeric variables in the dataset. Assess the quality of the model. The coefficient of determination is 67%, indicating that 2/3 of the target S02 variability is explained by the model (all numeric variables included in the dataset). Some of the variables have a p.value for the null names(usair) [1] "SO2" "Neg.Temp" "Manuf" "Pop" "Wind" "Precip" [7] "Days" "f.size" m1<-lm(so2~.,data=usair[,1:7]) summary(m1) Call: lm(formula = SO2 ~., data = usair[, 1:7]) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * Neg.Temp * Manuf *** Pop * Wind Precip Days Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 34 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 34 DF, p-value: 5.419e Consider model (m1), check significance for all variables. Some of the variables have a p.value for the null hypothesis test of coefficient equal 0 greater than 0.05, thus apparently, we have few observations - power of the test is low, Precip and Days are not significant. If we check for net-effects of the variables with Anova() method, perfectly suited for net-effect testing, the same conclussions appear: Precip and Days net-effects when the rest of the variable have already been included in the linear predictor are not significant summary(m1) Call: lm(formula = SO2 ~., data = usair[, 1:7]) Residuals: 9

10 Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * Neg.Temp * Manuf *** Pop * Wind Precip Days Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 34 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 34 DF, p-value: 5.419e-07 Anova(m1) Anova Table (Type II tests) Response: SO2 Sum Sq Df F value Pr(>F) Neg.Temp * Manuf *** Pop * Wind Precip Days Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' Assess default residual plots in R for model (m1): are there any atypical residuals? Which one/s? The first plot depicts the raw residuals vs fitted values according to the model, a noise pattern has to shown for valid models. In this case no patter is present, but some large residuals on the positive Y axis (, ). Normality of the residuals is checked with a QQPlot showing 2-3 cities that are not on the QQline, again these cities are, and a new city Phoenix; at least for the first two cities, residuals are too large to follow a normal distribution. On the scale-location plot, the smoother line is not flat, indicating non constant variance, but, but with 41 observations one has to be cautelous. The last plot on the right-down part shows an atypical city according to its leverage, so far away from the multidimensional cloud of points included in the design matrix, that does not seem relevant because the residual is close to 0. par(mfrow=c(2,2)) plot(m1,id.n=5) 10

11 Residuals Residuals vs Fitted Cincinnati Buffalo Milwaukee Standardized residuals Cincinnati Buffalo Normal Q Q Phoenix Fitted values Theoretical Quantiles Standardized residuals Scale Location Buffalo Cincinnati Phoenix Standardized residuals Residuals vs Leverage Phoenix Cook's distance Cincinnati Buffalo Fitted values Leverage 11. For your model, determine the presence of outliers in residuals. Specify city names, selected criteria and behavioral discrepancy. Atypical residuals appear for and, so lack of fit for these 2 cities are remarkable: the observed SO2 is much, much greater than the predicted value according to m1 model. is a medium city with observed SO2 of 61 micrograms per cubic meter while the model predicts and is a small city showing observed SO2 of 94 micrograms per cubic meter while the model predicts 45.24, so a large lack of fit is found for these two cities par(mfrow=c(1,1)) llist<-boxplot(resid(m1),labels=row.names(usair)) # For assessing atypical residuals 11

12 resid(m1) llist [1] "" "" predict(m1)[llist] usair[llist,] SO2 Neg.Temp Manuf Pop Wind Precip Days f.size Medium Small 12. For your final model, determine the presence of actual influent data. Specify city names, selected criteria and behavior. Boxplot for Cook s distance identifies two cities with large distances that lay far away from the distances for the rest of the observations: (too high SO2 air contents) and Phoenix (low SO2 air contents, with a negative prediction according to m1. Both observations are influent data that show a lack of fit with the current model m1 12

13 sort(cooks.distance(m1),decreasing=t) Phoenix Buffalo e e e-01 Cincinnati Little Rock e e e-02 Milwaukee Minneapolis-St. Paul Norfolk e e e-02 Denver New Orleans Albany e e e-02 Philadelphia Baltimore Kansas City e e e-02 Cleveland Memphis Dallas e e e-03 San Francisco Jacksonville Alburquerque e e e-03 Nashville St. Louis Des Moines e e e-03 Indianapolis Omaha Washington e e e-03 Seattle Charleston Miami e e e-03 Richmond Chicago Houston e e e-03 Salt Lake City Wilmington Louisville e e e-03 Atlanta Hartford Columbus e e e-04 Wichita Detroit e e-06 llist<-boxplot(cooks.distance(m1),labels=row.names(usair)) # For assessing actual influent observations 13

14 cooks.distance(m1) Phoenix Buffalo Cincinnati llist <-llist[c(1,5)];llist [1] "Phoenix" "" predict(m1)[llist] Phoenix usair[llist,] SO2 Neg.Temp Manuf Pop Wind Precip Days f.size Phoenix Medium Small 14