CHAPTER 1 A CASE STUDY FOR REGRESSION ANALYSIS

Similar documents
Donora, PA, (Source: New York Times, November )

Multi-site Time Series Analysis. Motivation and Methodology

New Problems for an Old Design: Time-Series Analyses of Air Pollution and Health. September 18, 2002

Exposure, Epidemiology, and Risk Program, Environmental Health Department, Harvard School of Public Health, Boston, Massachusetts, USA

The Clean Air Act of 1970 and Adult Mortality

Traffic related air pollution and acute hospital admission for respiratory diseases in Drammen, Norway

T he case-crossover design, introduced by in

Particulate Air Pollution and Mortality in 20 U.S. Cities:

Gasoline Consumption Analysis

Multi-site Time Series Analysis

Generalized Models to Analyze the Relationship between Daily

STAT 2300: Unit 1 Learning Objectives Spring 2019

Air Pollution and Daily Mortality in a City with Low Levels of Pollution

The Clean Air Act of 1970 and Adult Mortality

Bivariate Data Notes

A Statistical Comparison Of Accelerated Concrete Testing Methods

Assessing Seasonal Confounding and Model Selection Bias in Air Pollution Epidemiology Using Positive and Negative Control Analyses NRCSE

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS

Practical Regression: Fixed Effects Models

Estimating the Health Effects of Ambient Air Pollution and Temperature

Acknowledgments. Data Mining. Examples. Overview. Colleagues

TECHNICAL REPORT NO. 8

Agriculture and Climate Change Revisited

MEAN ANNUAL EXPOSURE OF CHILDREN AGED 0-4 YEARS TO ATMOSPHERIC PARTICULATE POLLUTION

Mentor: William F. Hunt, Jr. Adjunct Professor. In this project our group set out to determine what effect, if any, an air monitor s distance

Chapter 2 Tools of Positive Analysis

Seasonal Analyses of Air Pollution and Mortality in 100 US Cities

Seasonal Analyses of Air Pollution and Mortality in 100 US Cities

Chapter 5 Regression

10.2 Correlation. Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot.

Quadratic Regressions Group Acitivity 2 Business Project Week #4

ENVIRONMENTAL PROTECTION AGENCY APTI 435: Atmospheric Sampling Course. Student Manual

Particulate Matter Air Pollution and Health Risks ( edited for AHS APES)

Air Pollution and Premature Mortality in Rapidly Growing Economies: Evidence from Santiago, Chile

Seasonal Analyses of Air Pollution and Mortality in 100 U.S. Cities

Characterizing the long-term PM mortality response function: Comparing the strengths and weaknesses of research synthesis approaches

Lab 1: A review of linear models

PMF modeling for WRAP COHA

Comments On: Proposed Amendments: N.J.A.C. 7: and 7:27A-3.2 and Proposed New Rules: N.J.A.C. 7: and N.J.A.C.

Interrupted time series

Learn What s New. Statistical Software

Untangling Correlated Predictors with Principle Components

Air Quality, Infant Mortality, and the Clean Air Act of 1970*

Creative Commons Attribution-NonCommercial-Share Alike License

Layoffs and Lemons over the Business Cycle

Econometric Analysis Dr. Sobel

USEPA Region 7 Regional Technical Assistance Group

Forecasting for Short-Lived Products

Multilayer perceptron and regression modelling to forecast hourly nitrogen dioxide concentrations

Climate Change, Crop Yields, and Implications for Food Supply in Africa

COMPARISON OF PM LEVELS OF TWO PUBLIC HIGH SCHOOLS LOCATED IN DIFFERENT SOCIAL-ECONOMIC LOCATIONS

Statistics in Risk Assessment

Quantifying the long-term effects of air pollution on health in England using space time data. Duncan Lee Geomed th September 2015

Relative risk calculations

BETTER HEALTH FROM A BETTER ENVIRONMENT: A CROSS-COUNTRY EMPIRICAL ANALYSIS. By Hsiang-Chih Hwang. July Abstract

Correlation between Air Pollution and associated Respiratory Morbidity in Delhi

May 30, Harvard ACE Project 2 May 30, / 34

Modelling buyer behaviour/3 How survey results can mislead

Epidemiologic study design for investigating respiratory health effects of complex air pollution mixtures.

Science of the Total Environment

Categorical Predictors, Building Regression Models

AP Statistics Part 1 Review Test 2

Business Quantitative Analysis [QU1] Examination Blueprint

H R Anderson, S A Bremner, R W Atkinson, R M Harrison, S Walters

Methodology to Assess Air Pollution Impact on Human Health Using the Generalized Linear Model with Poisson Regression

Political economy of pollution in China

STAT 350 (Spring 2016) Homework 12 Online 1

Health impacts of air pollution in London: from understanding to forecasting

Semester 2, 2015/2016

A dverse short term effects of specific air pollutants on

Canadian Environmental Sustainability Indicators. Air health trends

Effect of E. coli mitigation on the proportion of time primary contact minimum acceptable state concentrations are exceeded: Technical note.

Preliminary Analysis of High Resolution Domestic Load Data

Design of Experiments

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

How to Lie with Statistics Darrell Huff

Time-Series Studies of Particulate Matter

Air Pollution and Total Mortality In California

Air Pollution and Total Mortality In California

Multiple Choice (#1-9). Circle the letter corresponding to the best answer.

Health impact assessment of exposure to inhalable particles in Lisbon Metropolitan Area

TIME-SERIES STUDIES OF PARTICULATE MATTER

Price transmission along the food supply chain

Learning Objectives. Module 7: Data Analysis

Lecture 4 Air Pollution: Particulates METR113/ENVS113 SPRING 2011 MARCH 15, 2011

FUNDAMENTALS OF QUALITY CONTROL AND IMPROVEMENT. Fourth Edition. AMITAVA MITRA Auburn University College of Business Auburn, Alabama.

Beachville, Oxford County Particulate Matter Sampling. Contact information:

Real Estate Modelling and Forecasting

Oregon. 800 NE Oregon St. #640 Portland, OR (971)

Key findings of air pollution health effects and its application

Revisions to National Ambient Air Quality Standards for Particle Pollution. Webinar for States and Local Agencies December 19, 2012

Effect of PM2.5 on AQI in Taiwan

Thus, there are two points to keep in mind when analyzing risk:

GCTA/GREML. Rebecca Johnson. March 30th, 2017

MEASURE DESCRIPTION. Temperatures: Daily, Weekly, Monthly, Seasonally Diurnal Temperature Range: Daily, Weekly Unit:

Outliers identification and handling: an advanced econometric approach for practical data applications

Managerial Economics

Which is the best way to measure job performance: Self-perceptions or official supervisor evaluations?

Air Quality and GHG Emission Impacts of Stationary Fuel Cell Systems

Is There an Environmental Kuznets Curve: Empirical Evidence in a Cross-section Country Data

Transcription:

CHAPTER 1 AIR POLLUTION AND PUBLIC HEALTH: A CASE STUDY FOR REGRESSION ANALYSIS 1

2

Fundamental Equations Simple linear regression: y i = α + βx i + ɛ i, i = 1,, n Multiple regression: y i = p j=1 x ij β j + ɛ i, i = 1,, n The ɛ i are random errors (typically mean 0, common variance σ 2 ) 3

QUESTIONS Which variable should be y and which should be x? Should either or both variables be transformed (for example, by taking logarithms) prior to analysis? Is the relationship between y and x linear, or should some nonlinear function be considered? Are there omitted variables whose inclusion might substantially change our conclusions about the form of the relationship? Are there outliers or influential values among either the x or y variables which may be distorting the interpretation of the data? Are the random errors truly independent, of equal variance, and normally distributed? Finally and perhaps most problematically of all having found a relationship between the two variables and tested its statistical significance, can this be taken as evidence of a causal effect? 4

Background on Air Pollution and Health It is widely recognized that high air pollution has a significant public health impact However there are many issues that are still debated in the context of present-day air pollution regulations: Which precise pollutants are responsible, eg ozone, sulfur dioxide, fine particulate matter, coarse particulate matter, etc? Is there a threshold effect, ie a safe level below which air pollution essentially has no adverse consequences? Which subsets of the population are most affected, and in what ways (what kinds of deaths or other adverse reactions such as asthma)? Exactly how should one quantify the overall effect? 5

Early Studies London in the 1950s Very high air pollution levels (20 30 times current EPA standards?) and sharp rises in deaths (eg the December 1952 smog in London is generally reckoned to have caused 4,000 excess deaths in a 4 5 day period) Graphs of daily deaths versus air pollution levels give some nice simple examples of linear regression relationships Research on what really happened continues up to this day 6

7

8

Plots of data show: Strong visible association between smoke levels (circles) and deaths (black dots) suggestion of a one-day lag (Fig 12) A scatterplot shows a more direct relationship with high correlation (Fig 13) However, more detailed plots from a different smog event show a number of possible complicating factors, eg temperatures were also very low on the days when deaths were high We can also see some distinction among different age groups (Fig 14) Individual scatterplots give more information, but just looking at correlations could be misleading for determining the true causal relationship (Fig 15) Time series over several London winters show gradually decreasing deaths as smoke levels sharply decreased (Fig 16) 9

We can also look at scatterplots of the annual aggregated deaths (Fig 17) Note the effect of the outlier due to the very cold 1963 winter

10

11

12

13

14

15

Modern Studies of Air Pollution and Health Modern studies apply many of the same methods of analysis to much larger data sets For example, Kelsall et al analyzed a 14- year series of daily mortality and air pollution in Philadelphia They decomposed the observed daily time series of deaths into three components, (a) seasonal and long-term trend, (b) meteorology, (c) air pollution They concluded that even though much of the variability can be attributed to (a) and (b), there is a strong enough residual effect under (c) to conclude that air pollution has a very significant effect, even at levels of air pollution much lower than the infamous London smogs JE Kelsall, JM Samet, SL Zeger and J Xu, American Journal of Epidemiology 146, 750 762 16

The present study is based on a re-analysis of the same data and is intended to illuminate many of the issues, including some that remain controversial Figure 18 Time series plot of weekly deaths in Philadelphia, with smoothed lowess curve The vertical dotted lines are placed to indicate the ends of years Figure 19 Scatterplots of daily deaths against four covariates, with fitted subsample averages (the text will describe the details of how these were calculated) TSP is Total Suspended Particulates (this measure has now been replaced in most studies by more specific measures of particles in different size ranges) SO2 is sulfur dioxide 17

1974-1981 350 Deaths 300 250 200 0 100 200 300 400 Week 1981-1988 350 Deaths 300 250 200 400 500 600 700 Week 18

(a) Temperature Deaths 0 20 40 60 80 20 30 40 50 60 (b) Dewpoint Deaths -20 0 20 40 60 80 20 30 40 50 60 (c) TSP Deaths 50 100 150 200 20 30 40 50 60 (d) SO2 Deaths 0 20 40 60 80 100 20 30 40 50 60 19

Outline of analysis (more details in Chapter 5): The y variable was taken to be the square root of daily death count The square root transformation is motivated in part by the fact that if the death counts have a Poisson distribution, which seems a reasonable intuitive assumption, then taking square roots stabilizes the variances See Section 53 for a more detailed discussion of transformations Seasonal trend was handled by representing the smooth curve in Fig 18 as a linear combination of 180 fixed basis functions Since we believe that the acute effects of interest occur within at most one week of a high-pollution event, these seasonal covariates should not confound the pollution effect 20

Meteorological effects were handled by introducing linear and quadratic terms for both temperature and dewpoint, with additional terms for the range above 75 o F and 60 o F respectively Moreover, to allow for delayed effects of up to four days, the current day s values were supplemented by those lagged from one to four days A variable selection was performed to remove insignificant terms which could still confound the pollution effects we are trying to measure Section 52 in Chapter 5 has more detail on variable selection Five pollutants were introduced both singly and in combination Since we do not know which lags of pollutant are most relevant, we used all possible lags between 0 and 4 days, as well as averages of 2, 3, 4 and 5 consecutive days within this period Among 21

the 15 possible exposure measures thus generated, the one with the most significant effect was selected and used in the subsequent analysis Possible data snooping criticisms of this

Conclusions The most significant exposure measure for TSP is current day s value The t value (estimate divided by standard error) is 31, which is significant at the (two-sided) level of 002 The most significant exposure measure for SO 2 is current day s value, with t = 33 The most significant exposure measure for NO 2 is the 4-day lagged value, with t = 21 The most significant exposure measure for CO is the average of the 3-day and 4-day lags, with t = 28 The most significant exposure measure for O 3 is the average of the current day s value with those for lags 1 and 2 This leads to t = 29 All of these are statistically significant at the 5% level, but note the selection effect of comparing different measures of exposure and only picking out the most significant 22

When different combinations of the variables are included, the results change For instance, with TSP and SO 2 together, the t statistics are 15 (TSP) and 14 (SO 2 ) Other analyses are: TSP, O 3 together: t=28 (TSP), 29 (O 3 ) TSP, SO 2, O 3 together: t=11, 16, 29 TSP, SO 2, NO 2, CO, O 3 all included: t =12, 17, 04, 21, 27 23

We can also look for nonlinear relationships between airpollution and mortality, which are fitted by more complicated versions of regression analysis For example, Fig 110 shows some possible piecewise linear relationships between TSP and mortality, suggesting higher slopes at higher TSP levels (relevant to determination of standards) The conclusions about multiple pollutants and nonlinear relationships suggest that the truth about how air pollution affects mortality may be more complicated than a simple linear relationship based on a single pollutant would suggest 24

(a) (b) 004 004 Effect 002 Effect 002 00 00-002 -002 0 20 40 60 80 100 TSP 0 20 40 60 80 100 TSP 25

Other types of study lead to different types of conclusion For example, Fig 111 is based on a famous prospective study in which adjusted mortality rate was plotted against median level of fine particulate matter in each of 51 cities The difficultes here include that measurement of particulate matter is very imprecise if we are only using a single value to represent a long period of study, and there are many other possible reasons why the deaths rate in these cities might differ (ecological bias problem) Pope, CA, Thun, MJ, Namboodiri, MM, Dockery, DW, Evans, JS, Speizer, FE and Heath, CW (1995), Particulate air pollution as a predictor of mortality in a prospective study of US adults Am J Respir Crit Care Med 151, 669 674 26

27

Summary of Chapter Air pollution is a major public health issue The relationship between air pollution (particulate matter, sulfur dioxide, etc) can be examined through regression relationships Simple analyses individual-day analyses of high pollution episodes, analyses based on annual summary statistics from long time series are interesting exercises but don t really help understand the full phenomenon Therefore, in recent years attention has move to the daily analyses of long time series which are much more informative, but also pose many complicated problems of interpretation There are other types of data sets (eg prospective studies) that add to the information (and the confusion) 28