Simple Linear Regression Technical: lin regression sd(errors) Conf Ints Pred Ints Conceptual Normal dist response y parameters non-linear A what-if prob model for prediction A screening process A scientific tool for testing/thinking non-normal predictors x scales (inc logs) MINITAB A simple model of the data generating system Cert Stats: Intro to Regression Week 1 1
Simple Lin Reg Y = +x + Revision Examples Extensions/Variations Scales Additive / Multiplicative Multiple regression Concepts in Statistical Modelling Statistical Inference Why model? The purpose of a statistical model is not: to fit the data is rather: to refine the question Cert Stats: Intro to Regression Week 1 2
Statistical Models (Some aspect of) Y simply related to (Some aspect of) X (Some aspect of) in reported units; some other units; more general. Simply related to Small change in X (Same?) Small change in Y Change Additive; Multiplicative; other Apart from some (small) (unpredictable) (error) Which is and Why model? Additive; Multiplicative; other Of (same) (avg size) for all values of X Cert Stats: Intro to Regression Week 1 3
Using models Conceptual A summary of the data to communicate to others A what-if prob model for prediction A screening process A scientific tool for testing/thinking A simple model of the data generating system Summary / Descriptive Comparison Insight into process Cert Stats: Intro to Regression Week 1 4
Scales and Change Time Salary, price, volume Quality Temperature Earthquake magnitude, ph Approval/Rating Likelihood/proportion Before/after; with/without Treatments/Admin unit: 1,2,3,4 or A,B,C,D Fuel consumption Miles/gallon Miles/litre Litres/1 km!!!!!!!!!!! Keep global warming to 2C, equiv 35.6F!!!!!!!!!!! Cert Stats: Intro to Regression Week 1 5
Completions House Completions vs Time 14 12 1 Fitted Line Plot Completions = - 1527868 + 769.4 time S 926.658 R-Sq 87.9% R-Sq(adj) 87.6% House completions data series. Based on the number of new dwellings connected by ESB Networks to the electricity supply 8 6 Source: Department of the Environment, Heritage and Local Government 4 Objective: To summarise 2 199 1992 1994 1996 time 1998 2 22 Question? Cert Stats: Intro to Regression Week 1 6
Price Sing $ Diamonds Carat wt vs Price Fitted Line Plot Price Sing $ = - 2298 + 11599 Carat Data on Carat - Wt of stones in carat units Price (Singapore $) 15 1 Regression 95% CI 95% PI S 1117.56 R-Sq 89.3% R-Sq(adj) 89.2% 5 Objective Nominal: Predict Price from Wt Actual: Is relationship linear? Question? -5.2.3.4.5.6.7.8.9 1. 1.1 Carat Prediction bands 2s Pricing the C's of Diamond Stones Singfat Chu National University of Singapore Journal of Statistics Education Volume 9, Number 2 (21) See http://www.amstat.org/publications/jse/jse_data_archive.htm http://www.amstat.org/publications/jse/v9n2/datasets.chu.html Cert Stats: Intro to Regression Week 1 7
Volume Trees: Vol 8 6 4 2-2 6 65 Fitted Line Plot Volume = - 87.12 + 1.543 Height 7 75 Height 8 85 9 Regression 95% CI 95% PI S 13.397 R-Sq 35.8% R-Sq(adj) 33.6% Sample of 31 black cherry trees in the Allegheny National Forest, Pennsylvania, Y = volume (cubic feet), X 1 X 2 = height (feet) = diameter (inches) (at 54 inches above ground Objective: Use height as proxy for volume Later Use diam and ht combined measure as proxy for volume Objective: Prediction via Calibrated Model Question? Prediction bands 2s Cert Stats: Intro to Regression Week 1 8
Compensation M.Stuart CEO Compensation (US$) and Company Sales (US$m) (Forbes Magazine, May 1994) 12 1 8 6 Fitted Line Plot Compensation = 1437416 + 61.57 Sales Total comp Industry Sales Regression 28816 Financial 95% 52 PI ComputersComm 242 553 1 Insurance 3653 1238 ComputersComm 2195 221641 Financial 238 25 Entertainment 415 S 1367165 R-Sq 13.7% R-Sq(adj) 13.5% 4 2 2 4 6 Sales 8 1 12 Question? Cert Stats: Intro to Regression Week 1 9
Gas Gas Gas Comsumption vs Temp Period 1 7 Fitted Line Plot Gas = 6.854 -.3932 Temperature S.281334 R-Sq 94.4% R-Sq(adj) 94.1% Weekly gas consumption (in 1 cubic feet) and average outside temperature (in deg C) for two "heating seasons : 26 weeks before / 3 weeks after cavity-wall insulation was installed. Thermostat was set at 2 C throughout. 6 5 4 Objective: Test: Measureable effect of insulation? Question? 3 2 2 4 6 Temperature 8 1 Period 2 5 Fitted Line Plot Gas = 4.724 -.2779 Temperature S.354848 R-Sq 81.3% R-Sq(adj) 8.6% Compare Intercepts & Slopes Before/After 4 3 2 1 2 4 6 8 1 Temperature Cert Stats: Intro to Regression Week 1 1
Stat Stat Math Marks 1 8 6 4 Fitted Line Plot Stat = - 12.32 + 1.8 Alg Regression 95% CI 95% PI S 12.966 R-Sq 44.2% R-Sq(adj) 43.5% 88 students in each of 5 maths exams Objective: Nominal: Predict one mark from another Actual: Understand inter-relationships Question? 2-2 -4 1 2 3 4 5 6 7 8 Alg Prediction bands 2s 1 8 6 Fitted Line Plot Stat = 9.361 +.758 Anal Regression 95% CI 95% PI S 13.792 R-Sq 36.9% R-Sq(adj) 36.1% 4 2 Mardia, K.V., Kent, J.T. & Bibby, (1979). Multivariate analysis. London: Academic Press. 1 2 3 4 Anal Cert Stats: Intro to Regression Week 1 11 5 6 7
Galton s heights data N=178 pairs; Y=Offspring (inches) X=Mid-parent (inches) Slope =.514 Galton s Data Objective: Nominal: Predict off-spring ht Actual: Quantify heredity Question? slope 1 Reversion to the mediocre Regression to the mean Corr =.51 Cert Stats: Intro to Regression Week 1 12
Plotting Scales Logs Revision Statistical Inference Normal distribution Prediction and Confidence Intervals Statistical significance Cert Stats: Intro to Regression Week 1 13
C o m p l e t i o n s Plotting Revision Fitted Line Plot Completions = - 1527868 + 769.4 time Over-plot fitted line 14 12 S 926.658 R-Sq 87.9% R-Sq(adj) 87.6% Visually estimate residuals 1 8 95% prediction interval line 2s parallel lines 6 4 2 199 1992 1994 1996 time 1998 2 22 Cert Stats: Intro to Regression Week 1 14
Gas Straight Line: changing the scale Change Temp to Fahrenheit (C, F) (, 32) Freezing (C, F) (1, 212) Boiling Fitted Line Plot Gas = 6.854 -.3932 Temperature Exercise Change Gas to ft³ 1 ft³ = 28317 L 7 6 5 4 S.281334 R-Sq 94.4% R-Sq(adj) 94.1% 3 2 2 4 6 Temperature 8 1 Temp in F Cert Stats: Intro to Regression Week 1 15
Straight Line: change the equation Gas = 6.85 -.393 Temp C Give equations for use with Fahrenheit X axis Slope + 1 C reduction of.393 in Gas +1.8 F reduction of.393 in Gas +1 F reduction of.393/1.8 =.218 in Gas Intercept C Gas consumption = 6.85 32 F Gas cons = 6.85 Gas F Gas cons = 6.85 -.218(-32)=13.82 = 13.82-.218 Temp F See EXCEL Change.Scale y 6.85.393Temp C Cert Stats: Intro to Regression Week 1 16 6.85.393 Temp F 32 /1.8 6.85.393 32 /1.8.393 /1.8 13.82.218Temp F Temp F
Gas CenGas Alt parameterisation: centred y a bx y y bx x y y bx bx Centred Coeffs before centering a=6.85 b=-.393 Means Gas 4.75 Temp 5.35 X =? Fitted Line Plot Gas = 6.854 -.3932 Temperature Fitted Line Plot CenGas =. -.3932 CenTemp 7 6 S.281334 R-Sq 94.4% R-Sq(adj) 94.1% 3 2 S.281334 R-Sq 94.4% R-Sq(adj) 94.1% 5 4 3 2 2 4 6 Temperature 8 1 Temp Gas Cen Temp Cen Gas -.8 7.2-6.15 2.45 -.7 6.9-6.5 2.15.4 6.4-4.95 1.65 2.5 6. -2.85 1.25 1-1 -2-7.5-5. -2.5. CenTemp 2.5 5. Cert Stats: Intro to Regression Week 1 17
Gas StGas Alt parameterisation: standardised y a bx Standardised y y s x x x x x b s y s y sx sx sy sy y y x x s s Now: intercept ; slope x x Coeffs before standardising a=6.85 b=-.393 Means- Gas: 4.75 Temp:5.35 SDs 2.87 1.16 Fitted Line Plot Gas = 6.854 -.3932 Temperature Fitted Line Plot StGas = -. -.9715 StTemp 7 S.281334 R-Sq 94.4% R-Sq(adj) 94.1% 2 S.241936 R-Sq 94.4% R-Sq(adj) 94.1% 6 1 5 4 3 2 2 4 6 Temperature 8 1 Temp Gas StTemp StGas -.8 7.2-2.14 2.11-2 -.7 6.9-2.11 1.85-2.4 6.4-1.72 1.42-1 -1 StTemp 1 2 Scale free Cert Stats: Intro to Regression Week 1 18
Revision: scale Descriptive Statistics: Stat, Anal, Alg, Vect, Mech 8 4 Stat Matrix Plot of Stat, Anal, Alg, Vect, Mech 25 5 4 8 Variable Mean SE Mean StDev Stat 42.31 1.84 17.26 Anal 46.68 1.58 14.85 Alg 5.6 1.13 1.62 Vect 5.59 1.4 13.15 Mech 38.95 1.86 17.49 7 45 2 8 4 4 8 Anal 2 Alg 45 7 Vect Mech 4 5 25 8 4 8 Correlations: Stat, Anal, Alg, Vect, Mech Stat Anal Alg Vect Anal.67 Alg.665.711 Vect.436.485.61 Mech.389.49.547.553 Cell Contents: Pearson correlation Derive Equations: Stats vs Algebra Stat 42.31 Alg5.6.655 17.26 1.62 Stats vs Analysis Cert Stats: Intro to Regression Week 1 19
Scale: Additive or Multiplicative Y depends on x; some random variation involved Data (y 1, x 1 ), (y 2, x 2 ) Additive or Multiplicative changes? (y 2 - y 1 ) depends on (x 2 - x 1 ) (y 2 / y 1 ) depends on (x 2 / x 1 ) Multiplicative change eg Doubles 1% increase Log Transform renders Mult as Additive (log(y 2 ) - log(y 1 ) ) depends on (log(x 2 ) - log(x 1 ) ) Cert Stats: Intro to Regression Week 1 2
Price Sing $ Price Sing $ 15 1 5 Fitted Line Plot Price Sing $ = - 2298 + 11599 Carat Diamonds Regression 95% CI 95% PI S 1117.56 R-Sq 89.3% R-Sq(adj) 89.2% Minitab? Model details? Compare? Improve? Fitted Line Plot log1(price Sing $) = 3.964 + 1.537 log1(carat) -5.2.3.4.5.6.7.8 Carat Prediction bands 2s.9 1. 1.1 18 16 14 12 Regression 95% CI 95% PI S.731222 R-Sq 95.7% R-Sq(adj) 95.7% 1 8 6 4 2 Prediction bands?.2.3.4.5.6.7 Carat.8.9 1. 1.1 Cert Stats: Intro to Regression Week 1 21
Volume Log1Vol Volume Trees: Vol vs Height Fitted Line Plot Log1Vol = - 6.62 + 3.982 Log1Ht Fitted Line Plot log1(volume) = - 6.62 + 3.982 log1(height) 2.2 2. Regression 95% CI 95% PI 1 Regression 95% CI 95% PI 1.8 1.6 S.176926 R-Sq 42.1% R-Sq(adj) 4.1% S.176926 R-Sq 42.1% R-Sq(adj) 4.1% 1.4 1.2 1. 1.8.6 1.8 1.82 1.84 1.86 1.88 Log1Ht 1.9 1.92 1.94 6 65 7 75 Height 8 85 9 Fitted Line Plot log1(volume) = - 6.62 + 3.982 log1(height) 12 1 8 6 4 2 6 65 7 75 Height 8 85 9 Regression 95% CI 95% PI S.176926 R-Sq 42.1% R-Sq(adj) 4.1% Minitab? Model details? Compare? Improve? Cert Stats: Intro to Regression Week 1 22
Multiplicative Change Salary increases by 2% Salary Salary + 2% of Salary Salary + Salary.2 Salary 1.2 Salary decreases by 2% Salary Salary.8 Salary increases by 2% then deceases by 2% Salary (Salary 1.2).8 = Salary? Cert Stats: Intro to Regression Week 1 23
Recall: Logs and Percentages 1 log 1 log1 1 log 5.69897 log 5 log 5 1 log 5 log 1.69897 1 log 1 log 1 1 log 1 log 1 2 log1 1.69897 Antilog 2 1 1 Antilog 1.69897 1 1 1 5 5 Salary increases by 2% Salary Salary 1.2 log Salary log Salary log1.2= log Salary.79 Salary decreases by 2% Salary Salary.8 log Salary log Salary log.8= log Salary.97 Log Salary increases by.79 then decreases by.79 Salary unchanged by 2% by 16.6% Not symmetric by 1.2 1 by 1.2 by ie by.833 1.2 Cert Stats: Intro to Regression Week 1 24
Changing to/from Log scale Equation in log scale: log( y) a blog( x) ab log( x) a b log( x) a log( x) b a b x a log( x) b a b a/ b 1 1 1 1 1 Antilog( y) 1 1 1 Equation in lin. scale: y 1 1 1 y x OR x b y x OR Rescaled x b b Increase log( x) by1 Increase log( y) by b (linearly) Multiply x by1 b Increase y by factor of 1 (multiplicatively) b b y y changes to y 11 (linear change of 1 ) Cert Stats: Intro to Regression Week 1 25
Trees: Changing to/from Log scale Equation in log scale: log( Volume) 6.62 3.982log( Height) log( ) b 6.62 3.982 1 ( Height) Equation in lin. scale: Volume 1 1 1 Volume a x a b Volume.867 ( Height) x 3.982 6.62 3.982 3.982 (1 ) OR Volume Height (.3 Height) 3.982 Volume Height doubles (. 3 Height) 4 Refine the question? Cert Stats: Intro to Regression Week 1 26
CEO Compensation Which scale to use? What does this tell us about the nature of the relationship? Compensation Rescaled Sales 1/4 Sales Double Question? Cert Stats: Intro to Regression Week 1 27
Error Term in Log Scale Linear Scale Model y a bx y predicted y 2 Error Model, 95%of random error lie within 2 N s s Notation sometimes used y predicted y 2 s s small; y close to pred y Error band width constant Log Scale Model log y a blog x log y predictedlog y 2 s y 2 Linear Scale Plot y predicted y 11 Notation sometimes used y predicted 1 1 s s small; y close to pred y Error band y Multiplicative error Cert Stats: Intro to Regression Week 1 28
Error Term in Log Scale Not symmetric Cert Stats: Intro to Regression Week 1 29
Frequency FITS1 Revision Computing Scatterplot of FITS1 vs time Comps Time RESI1 FITS1 4296 199. 1116.75 3179.25 4477 199.25 115.41 3371.59 511 199.5 1447.7 3563.93 4752 199.75 995.72 3756.28 4692 1991. 743.38 3948.62 396 1991.25-234.96 414.96 4895 1991.5 561.7 4333.3 4979 1991.75 453.35 4525.65 4155 1992. -562.99 4717.99 563 1992.25 692.67 491.33 5886 1992.5 783.33 512.67 5338 1992.75 42.98 5295.2 3684 1993. -183.36 5487.36 4487 1993.25-1192.7 5679.7 589 1993.5-783.4 5872.4 Etc... 12 11 1 9 8 7 6 5 4 3 12 1 8 199 1992 1994 1996 time Histogram of RESI1 Normal 1998 2 22 Mean StDev 926.7 N 44 6 4 2-2 -1 1 2 RESI1 Cert Stats: Intro to Regression Week 1 3
Frequency FITS1 Revision Computing Scatterplot of FITS1 vs time Comps Time RESI1 FITS1 4296 199. 1116.75 3179.25 4477 199.25 115.41 3371.59 511 199.5 1447.7 3563.93 4752 199.75 995.72 3756.28 4692 1991. 743.38 3948.62 396 1991.25-234.96 414.96 4895 1991.5 561.7 4333.3 4979 1991.75 453.35 4525.65 4155 1992. -562.99 4717.99 563 1992.25 692.67 491.33 5886 1992.5 783.33 512.67 5338 1992.75 42.98 5295.2 3684 1993. -183.36 5487.36 4487 1993.25-1192.7 5679.7 589 1993.5-783.4 5872.4 Etc... 12 11 1 9 8 7 6 5 4 3 12 1 8 199 1992 1994 1996 time Histogram of RESI1 Normal 1998 2 22 Mean StDev 926.7 N 44 6 4 2-2 -1 1 2 RESI1 Cert Stats: Intro to Regression Week 1 31
Frequency 12 1 8 6 4 2 Histogram of RESI1 Normal Revision Normal Dist Mean StDev 926.7 N 44 Long run props 68% within ± 1 SD of mean 95% within ± 2 SD of mean -2-1 RESI1 1 2 99.7% within ± 3 SD of mean 44 resids Mean. SD 915.8 1116.8 43. -64.8-2.6 15 of 44 115.4-183.4 275.9 295.1 outside 1447.1-1192.7-1213.5 1651.8 ( 915.8) 995.7-783. -779.8-173.6 743.4-23.4-53.2-68.9 29 of 44 = -235. -1965.7 319.5 299.7 65.9% 561.7-1183.1-187.8 1398.4 453.4 193.6 5.8-571. -563. 362.2 241.5 524.7 692.7-1256.1 836.1 634.4 783.3-169.4-1324.2 1423. Cert Stats: Intro to Regression Week 1 32
Revision Normal Dist Sketch Normal Mean 25 SD 4 68% within ± 1 SD of mean 95% within ± 2 SD of mean 99.7% within ± 3 SD of mean Cert Stats: Intro to Regression Week 1 33
Stat Revision: SumSq 1 8 6 4 2 1 2 Fitted Line Plot Stat = 9.361 +.758 Anal 3 4 Anal 5 6 7 Regression 95% PI S 13.792 R-Sq 36.9% R-Sq(adj) 36.1% Descriptive Statistics: Sum of Variable Count Mean StDev Variance Squares Stat 88 42.31 17.26 297.76 183413. Anal 88 46.68 14.85 22.38 21942. RES 88 -. 13.71 187.98 16354.67 FITS 88 42.31 1.48 19.77 16758.33 CenStat 88 -. 17.26 297.76 2594.72 CenAnal 88 -. 14.85 22.38 19173.9 CenFit 88 -. 1.48 19.77 955.5 Regression Analysis: Stat versus Anal The regression equation is Stat = 9.361 +.758 Anal S = 13.792 R-Sq = 36.9% R-Sq(adj) = 36.1% Analysis of Variance Source DF SS MS F P Regression 1 955. 955.5 5.22. Error 86 16354.7 19.17 Total 87 2594.7 Stat Anal RESI1 FITS1 81 67 24.3534 56.6466 81 7 22.2362 58.7638 81 66 25.592 55.948 68 7 9.2362 58.7638........ Cert Stats: Intro to Regression Week 1 34
Revision: Covar,Corr and R 2 Regression Analysis: Stat versus Alg Stat = - 12.32 + 1.8 Alg S = 12.966 R-Sq = 44.2% Analysis of Variance Source DF SS MS Regression 1 11446.6 11446.6 Error 86 14458.1 168.1 Total 87 2594.7 Covariances: Stat Alg FITS1 RESI1 Stat 298 Alg 122 113 FITS1 132 122 132 RESI1 166 166 Covariances: Stat Alg FITS1 RESI1 Stat 297.755 Alg 121.871 112.886 FITS1 131.57 121.871 131.57 RESI1 166.185 -. -. 166.185 Correlations Stat Alg RESI1 Alg.665 RESI1.747 -. FITS1.665 1. -. Correlations Stat Alg FITS1 RESI1 Stat 1 Alg.66 1 FITS1.66 1. 1 RESI1.75. 1 Cert Stats: Intro to Regression Week 1 35
Revision: Corr, R 2, S Descriptive Statistics: Stat, Anal, Alg, Vect, Mech 8 4 Stat Matrix Plot of Stat, Anal, Alg, Vect, Mech 25 5 4 8 Variable Mean SE Mean StDev Stat 42.31 1.84 17.26 Anal 46.68 1.58 14.85 Alg 5.6 1.13 1.62 Vect 5.59 1.4 13.15 Mech 38.95 1.86 17.49 7 45 2 8 4 4 8 Anal 2 Alg 45 7 Vect Mech 4 5 25 8 4 8 Correlations: Stat, Anal, Alg, Vect, Mech Revision Stats vs Anal Stat Anal Alg Vect Anal.67 Alg.665.711 Vect.436.485.61 Mech.389.49.547.553 Derive R 2, S Cell Contents: Pearson correlation Cert Stats: Intro to Regression Week 1 36
Revision: Statistical Inference Point Estimates Best single values for intcpt/slope Conceptually Propose many values Compute SumSq of implied residuals Choose values with min SSQ Margin for Error Confidence Intervals Prediction intervals Statistical Testing Cert Stats: Intro to Regression Week 1 37
Are precision details important? Precision about what? Coefficients Gas = 5.49 -.29 Temp (-.29 ±?) Need to test null hypotheses? Is there evidence against insulation does not change Gas/Temp relationship? Cert Stats: Intro to Regression Week 1 38
Concept: Statistical Model data like this Replication Data Generating System Comp Generated Random Numbers y i ~ N, i 2 Cert Stats: Intro to Regression Week 1 39
Concept: Simple linear regression model y i ~ N, i 2 Data like this Line Data "like completions" High Low 14 12 1 8 6 4 2 1988 199 1992 1994 1996 1998 2 22 Pinch of salt Cert Stats: Intro to Regression Week 1 4
Concept: Sampling Distribution What results with data like this? Monte Carlo Experiments Sampling Distribution St Dev (Samp Dist) = St Error Alt Formulae methods Used by MINITAB Cert Stats: Intro to Regression Week 1 41
M.Stuart Concept: Statistical Inference Choose the best fitting model Treat it with scepticism How much? Rely on formulae based on the Normal dist Use random computer replication eg modern bootstrap methods Cert Stats: Intro to Regression Week 1 42
Science and Statistical Inference Assumptions: Well stated and focussed science Adequate and relevant data Adequate model for data like this Independent sources of info random variation Normal distribution Cert Stats: Intro to Regression Week 1 43
More than one predictor? Multiple Regression Statistics Marks: Four other Math marks Tree Volume: Tree Height and Diam at Chest Ht Diamond Price: Carat wt and Aspects of Quality Gas: Temp and Insulation Status Housing Comps: Seasonality Cert Stats: Intro to Regression Week 1 44