Ronald C. Henry Yu-Shuo Chang Clifford H. Spiegelman NRCSE. T e c h n i c a l R e p o r t S e r i e s. NRCSE-TRS No. 071.

Size: px
Start display at page:

Download "Ronald C. Henry Yu-Shuo Chang Clifford H. Spiegelman NRCSE. T e c h n i c a l R e p o r t S e r i e s. NRCSE-TRS No. 071."

Transcription

1 Locatng Nearby Sources of Ar Polluton by Nonparametrc Regresson of Atmospherc Concentratons on Wnd Drecton Ronald C. Henry Yu-Shuo Chang Clfford H. Spegelman NRCSE T e c h n c a l R e p o r t S e r e s NRCSE-TRS No. 071 Sept 1, 001 The NRCSE was establshed n 1997 through a cooperatve agreement wth the Unted States Envronmental Protecton Agency whch provdes the Center's prmary fundng.

2 Locatng Nearby Sources of Ar Polluton by Nonparametrc Regresson of Atmospherc Concentratons on Wnd Drecton Ronald C. Henry * and Yu-Shuo Chang, Envronmental Engneerng Program, Cvl Engneerng Department, Unversty of Southern Calforna, 360 South Vermont Avenue, Los Angeles, CA Clfford H. Spegelman, Department of Statstcs, Texas A&M Unversty, College Staton, Texas. * Correspondng author. Submtted to Atmospherc Envronment July, 001. Abstract The relatonshp of the concentraton of ar pollutants to wnd drecton has been determned by nonparametrc regresson usng a Gaussan kernel. The results are smooth curves wth error bars that allow for the accurate determnaton of the wnd drecton where the concentraton peaks, and thus, the locaton of nearby sources. Equatons for ths method and assocated confdence ntervals are gven. A non-subjectve method s gven to estmate the only adjustable parameter. A test of the method was carred out usng cyclohexane data from 1997 at two stes near a heavy ndustral regon n Houston, Texas, USA. Accordng to publshed emssons nventores, 70 percent of the cyclohexane emssons are from one source. Nonparametrc regresson correctly dentfed the drecton of ths source from each ste. The locaton of the source determned by trangulaton of these drectons was less than 500 m from that gven n the nventory. Nonparametrc regresson s a powerful technque that has many potental uses n ar qualty studes and atmospherc scences. Keywords: Ar Polluton, Statstcs, Data Analyss, Wnd Drecton, Nonparametrc Regresson Introducton The problem addressed here s estmatng the wnd drecton that gves a local maxmum n the observed average concentraton of an atmospherc speces,.e., fndng the drectons of peaks n the concentratons. Ths drecton s taken as the drecton of the source, assumng that the source s not too dstant. Fndng the locaton of a nearby source s mportant n dentfcaton of the causes of local toxc hot spots and reconclaton of emsson nventores observed concentratons, to gve two examples. Sommervlle et al present a classc parametrc modelng approach to ths problem. In ths paper, an alternatve 1

3 nonparametrc approach s taken. Ths approach s related to the kernel densty countng procedure proposed by de Haan 1999 for certan ar qualty models. There s an enormous statstcal lterature on nonparametrc regresson. Härdle 1990 gves a very good ntroducton to ths lterature and the more practcal aspects of the subject. As wll be seen, nonparametrc regresson s a powerful, well-developed method wth many possble applcatons to ar qualty studes and, ndeed, atmospherc scences n general. It s very dffcult to locate even a sngle strong peak accurately from a smple scatter plot of the data versus hourly resultant wnd drecton. Ths s seen n Fgure 1, a scatter plot of hourly concentratons of cyclohexane observed durng 1997 at the Deer Park ste near the Houston shp channel n Houston, Texas, USA, an area domnated by large refneres and petrochemcal ndustres. The wnd speed and drecton were measured at the ste. The wnd drecton s the azmuth angle measured clockwse from north that the wnd s blowng from. The fgure clearly shows hgh concentratons when the wnd comes from between 0 and 50 degrees and between 300 and 350 degrees, but that s about all that can be sad. The usual method of analyss of the data n Fgure 1 s to group the data nto bns of wdth?? based on wnd drecton and calculate the average concentraton n each bn. The result can be dsplayed as a smple bar chart as n Fgure or as a polar chart. Polar charts are not used here because small peaks are forced nto an area near the orgn, makng them hard to see. In Fgure the bns are 10 degrees wde and start at 0 degrees. When the wnd speed s low, the drecton s not well determned, consequently all hours wth wnd speed less than 1 mle per hour were excluded from Fgure. It s now clear that the data possess several peaks, ncludng a large peak around 330 degrees; but a more precse estmate of the peak locaton s not possble for reasons dscussed next. Bar plots such as Fgure have major lmtatons for the problem at hand. The locaton of the peaks s hghly dependent on the choce of the bn sze?? and the locaton of the boundares of the bns. Peaks that are closer together than?? may not be resolved and the locaton of a peak maxmum cannot be estmated to better than ±??, at best. Ths s less of a problem f?? can be made as small as a degree or two. Wth bns ths small, however, almost always there are many bns n the less frequent wnd drectons that wll

4 have too few observatons. In practce, even wth hourly data for an entre year, bn sze can seldom be made less than 10 degrees. In addton to the large peaks, several small peaks are seen n Fgure at about 170, 0, 60, and 90 degrees. Whch, f any, of these are real peaks and whch are just random fluctuatons n the data? Relable error bars or confdence ntervals would help answer these questons. For large peaks, confdence ntervals would put bounds on the peak heght and provde a measure of the error n peak locaton. Based on the above dscusson, a method s needed to estmate the locaton of large peaks more precsely and relably separate peaks that are close to each other. The method should produce statstcal confdence ntervals. Fnally, any parameters needed by the method, such as bn sze n the bar chart, should be estmable by a reproducble, quanttatve algorthm, to reduce the subjectvty of the analyss. Such a method wth all these propertes exsts and s known n the statstcal lterature as nonparametrc regresson. The followng secton wll brefly ntroduce the method. Ths s followed by the applcaton of the method to cyclohexane data from two stes n Houston, Texas. Cyclohexane s chosen to test the method because 70% of the ndustral emssons n the regon are known to be from a sngle source. Thus, the ntersecton of two lnes drawn from each ste n the drecton of the largest peak should be the locaton of ths source. The success of the method can be judged by how closely the predcted poston of the source corresponds to the known locaton. Nonparametrc Regresson Kernel Estmators One obvous mprovement that overcomes some of the problems of a smple bar chart s to average over a sldng wndow of wdth?? centered at?. Let the observed average concentraton for the tme perod startng at t be C, where = 1,, n observatons. Further, let the resultant wnd drecton for the th tme perod be W, then the average concentraton n the sldng wndow centered at? s 3

5 1 n C = N K W C, 1 = 1 where Kx=1, for x?? / = x = x +?? /, and zero otherwse, and N s the number of data ponts nsde the sldng wndow. Fgure 3 shows the results of ths method appled to the same data as Fgure 1. Ths s certanly an mprovement over Fgure, but the curve n Fgure 3 s not very smooth, whch makes determnng peak locatons dffcult. The problem s that the functon K gves equal weght to all the measurements nsde the sldng wndow. A more reasonable approach would be to gve less weght to observatons near the edges, as shown next. To generalze equaton 1, t s mportant that??, also called the smoothng parameter, appear explctly n the equaton, so equaton 1 s rewrtten as n W K C = 1 C, =, n W K = 1 where Kx = 1 for -1/ = x = 1/, and zero otherwse. Note that the denomnator s smply a complcated way of wrtng N, the number of data ponts for whch??? / = W =? +?? /. In ths form the equaton can be generalzed by takng Kx to be any contnuous functon of x such that K x dx =1. 3 There are many possble choces for K, two of the most often used are: the Gaussan kernel 1/ K x = π exp 0.5x, -8 < x < 8, 4 and the Epanechnkov kernel. K x = x, -1 = x = 1. 5 Both of these kernels wll gve maxmum weght to observatons near? and less weght to observatons further away. The major dfference between the two s that the Gaussan kernel s defned over an nfnte doman and the Epanechnkov kernel s defned over a fnte range. For wnd drecton and other crcular 4

6 data the Gaussan kernel s preferred. For data lmted to a fnte range the Epanechnkov kernel has less bas at the end ponts and s preferred, and, under certan condtons, can be shown to converge to the true expected value at the optmal rate. The prmary nonparametrc regresson estmator for ths paper s gven by equaton wth the Gaussan kernel. Techncally, ths s an example of a Nadaraya - Watson estmator, whch s known to be consstent, that s, as sample sze ncreases the value of the estmate wll converge to the true value Härdle, 1990, p. 5. Fgure 4 shows the result of usng ths estmator on the Deer Park cyclohexane data. The plot s much smoother than the movng average plot n Fgure 3. The gray regon surroundng the curve n the fgure s the 95 percent confdence nterval, whch wll be dscussed later. The smoothng parameter for Fgure 4 was chosen to produce results comparable wth the 10-degree bns n Fgure. To ths end, we defne the smoothng parameter n terms of the Full Wdth at Half Maxmum FWHM, an ntutve measure of the wdth of the kernel functon. It s smply the full wdth of the peak n K measured at the pont where the curve has fallen to half ts value at the peak. For the Gaussan kernel, the FWHM and smoothng parameter are related by: FWHM =. 6 Thus, snce the FWHM s 10, the Gaussan smoothng parameter usually called the standard devaton for Fgure 4 s 10/v = In the followng, the FWHM wll be used as a more ntutve surrogate for the smoothng parameter. Choosng the smoothng parameter The most mportant decson n nonparametrc regresson s the choce of the smoothng parameter, or, equvalently, the FWHM. If the FWHM s too large the curve wll be too smooth and peaks could be lost or not resolved. If t s too small the curve wll have too many small, meanngless peaks domnated by nose or large peaks may resolve nto false multple peaks 5

7 6 There are several ways to select the best smoothng parameter. Ths paper apples the cross valdaton method Härdle, 1990, p.15. For each observed wnd drecton W j, j=1 n and assocated concentraton C j =Ct j use equaton to estmate the expected concentraton, but leavng out the jth observaton,.e. = j j j j j j W W K C W W K W C,. 7 The optmal smoothng parameter s the one that mnmzes V??, the mean squared dfference between concentraton estmated leavng out one observaton and the observed concentraton: 1, = = j j n j j W C C V. 8 For the Deer Park data n Fgure 4, the mnmum of V occurs at a FWHM of 7. Ths s close to the value of 10 used n Fgure 4 but ndcates that Fgure 4 may be somewhat over-smoothed. Confdence ntervals The confdence ntervals n Fgure 4 are calculated from formulae based on the asymptotc normal dstrbuton of the kernel estmates. The sample estmate of the varance of the asymptotc dstrbuton of, C s gven by Härdle, 1990, pp :., and, K, for a gaussan, 1 where σ π σ = = = = = = = C C W K n f W K n f dx x K c f n c s n n K K 9 Thus, f c a s the 100-a-quantle of the normal dstrbuton 1.96 for a = 0.05 the confdence lmts on, C are gven by:

8 C, ± c s. 10 a If a=0.05, then the expresson above gves a two-sded 95 percent confdence nterval. Ths s shown as the gray shaded area of Fgure 4. From these confdence ntervals, t s obvous that the small peak near 170 s real but the peaks near 60 and 90 are not. The peak near 0 s not an obvous call and requres further analyss. The dervaton of the equatons for the confdence ntervals gven above are based on assumptons about the error structure of the data that are often volated by ar qualty data, e.g. homogeneous varance and lack of seral correlaton. However, such formulae often perform well on data that does not strctly satsfy the assumptons of the dervaton. So a test of the formulae for confdence ntervals s a worthwhle exercse. The only ndependent method of estmatng confdence ntervals from data that s not smulated s the bootstrap. A bootstrap wth 1000 resamplngs was done on the nonparametrc regresson of cyclohexane wnd drecton for the two stes used n ths paper. From the resultng 1000 nonparametrc regresson curves, emprcal 95 percent confdence ntervals were calculated for each ste. These emprcal confdence ntervals were almost dentcal to those calculated by the above formulae. Indeed, the graph of the two curves looks lke one curve and s not shown for ths reason. These bootstrap results are dscussed further n the next to last secton of ths paper. Bas n nonparametrc regresson Ths type of nonparametrc regresson has some obvous drawbacks, chef among these beng bas. Snce the data s beng smoothed, the peaks usually wll not be qute as hgh or sharp as n realty. Ths bas s an nevtable result of the smoothng. Bas can be estmated by smply usng the output curve n Fgure 4 as the nput to the nonparametrc regresson. The bas estmate s the dfference between the twce-smoothed curve and the once-smoothed curve. Calculated ths way, the bas n Fgure 4 s less than 10 percent at the peaks, and much less than ths elsewhere. Because the bas s small n the examples consdered here, t wll be gnored n the rest of the paper. 7

9 Applcaton to 1997 Houston Cyclohexane Data Data Concentraton data for cyclohexane and other Volatle Organc Compounds VOCs for the year 1997 were obtaned from the U. S. Envronmental Protecton Agency s EPA Photochemcal Assessment Montorng Statons PAMS database. Hourly concentratons from an automated gas chromatograph from two stes were avalable. The two stes, Deer Park and Clnton Drve, are shown n Fgure 7 n context wth nearby 1997 emssons of VOCs taken from the EPA s AIRS database. Ths database does not have emssons of ndvdual speces, unlke the Ar Toxcs Emssons Inventory. Extracted from the ar toxcs nventory, Table 1lsts all the emssons of cyclohexane for the year 1997 n Harrs County, Texas, whch ncludes the cty of Houston and the Houston Shp Channel, an area of major petroleum refnng and petrochemcal ndustres. Table 1 shows that one company, Phllps Petroleum, s the source of almost 70 percent of the emssons n the nventory. Thus, one would expect that hgh concentratons of cyclohexane would be assocated wth wnd drectons at the stes comng from ths faclty. Lnes drawn n the drecton of the maxmum cyclohexane concentratons from the two stes should ntersect near the locaton gven for the source n the nventory. The accuracy of the nonparametrc regresson can be judged by how close the estmated poston s to the putatve poston. Results Fgure 5 and Fgure 6 are the result of nonparametrc regresson of cyclohexane on wnd drecton at the two stes usng optmal FWHM values. Much of the wnd data at the Clnton Drve ste was mssng. Thus, the wnd drecton and speed data from Deer Park were used n the analyss of both stes. The two stes are only km apart and the terran s very flat, so t s expected that the wnd data at one ste wll serve for both. Fgure 5 s for Deer Park but t s not the same as Fgure 4 for two reasons. Only data wth wnd speed greater than 6 mles per hour 9.66 km per hour were used, nstead of all data greater than 1 mph as for Fgure 4. The optmal FWHM for data wth wnd speed greater than 6 mph was calculated to be 5, sgnfcantly smaller than 10 used n Fgure 4. The data are restrcted to perods wth wnd speed greater than 6 mph because the Deer Park ste s 9.5 km from the Phllps source and the clearest mpact wll be seen f the travel tme from the source to the sampler s less than 1 hour. For the same reason, the data from 8

10 Clnton Drve were restrcted to hours where the wnd speed s greater than 5 mph 8 km per hour snce the source s 7.91 km from the ste. Usng the known locaton of the source to restrct the data s not a case of unacceptable crcular reasonng usng the source locaton to estmate the source locaton. Data for all wnd speed greater than 1 mph could have been used to estmate the locaton of the peaks and get an approxmate locaton of the source. From ths the approxmate dstance to the stes can be estmated, whch would lead to the same result wthout pror knowledge of the source locaton. Table gves the wnd drecton and the expected concentraton of the four largest peaks n Fgure 5 and Fgure 6. The nonparametrc regresson calculatons were carred out for each whole degree from 0 to 359. Thus, to get the entres n Table t was necessary to nterpolate to fnd the peak locatons wth greater precson than 1 degree. If the data are equally-spaced wth spacng h and local maxmum Cx o, then the nterpolated maxmum s at x o + ph, where: p C x C x 0 h C x h C x 0 + h 0 0 = C x 0 + h The nterpolated maxmum concentraton s gven by C x + ph 0.5 p p 1 C x0 h + 1 p C x p p + 1 C x0 + 0 h Both these formulae are from Abramowtz and Stegun Comparson to known sources The sources n Table 1 can now be compared wth the peaks n Table. A measure of the uncertanty of the peak locatons s helpful n ths comparson. Table gves very conservatve ranges for the peak locatons that are calculated by drawng a horzontal lne through the peak and readng ts ntersecton wth the upper confdence boundary. Not surprsngly, at both stes the largest peak has an azmuth consstent wth the locaton of the largest source n the nventory. For Clnton Drve, the second largest source n the nventory also les n the azmuth range of the largest peak n Fgure 6. Ths may explan why ths peak s so broad. The second largest peak for Clnton Drve s at 160, whch corresponds well wth Valero Refnng n the nventory. Valero only accounts for 4.79 percent of the emssons but t s located only 1.17 km from the montorng ste. It seems reasonable to assocate ths peak wth ths source. The remanng 9

11 two small peaks n Fgure 6 do not correspond to any source n Table 1. These could be emssons assocated wth non-ndustral sources such as roadways. At Deer Park, the locaton of the second largest peak n Fgure 5 corresponds closely wth Enchem Amercas n Table 1. However, the locaton of the source s gven an accuracy of only 11 km, so one cannot defntely assocate ths peak wth ths source. The remanng two small peaks n Fgure 5 do not correspond to any sources n Table 1. Locaton of the largest source The locaton of the Deer Park and Clnton Drve stes as gven by the AIRS database are N, W, and N, W. Usng these postons and the azmuth of the largest peaks the estmated locaton of the source s N, W. All calculatons nvolvng longtude and lattude were performed usng functons from the Matlab Mappng Toolbox. The dstance between ths and the locaton gven n Table 1 s km. The varablty n the predcted locaton was estmated by 1000 bootstrap calculatons. The peak postons were normally dstrbuted wth the standard devaton for Deer Park beng only degrees and degrees for Clnton Drve. Ths gves 99 percent confdence ntervals for the peak locaton of 37.5, and 8.89, for Deer Park and Clnton Drve, respectvely. Note that these ntervals are much smaller than those gven n Table, valdatng the statement that the ranges n the Table are very conservatve ranges. The correspondng estmated source locatons calculated for the 1000 bootstrap samples are gven n Fgure 7. The cloud of ponts does not qute cover the locaton of the source gven n Table 1. However, the fgure also shows that the locaton n Table 1 does not correspond to any of the VOC sources n the AIRS nventory, so t may not be entrely relable. Interestngly, the cloud of ponts les neatly between the locaton n Table 1 and the sources of Phllps Petroleum that le just to the south. No VOC emssons are hdden under the cloud and the nearby VOC sources to the east are not part of Phllps Petroleum. All thngs consdered, the locaton predcted by the nonparametrc regresson s n excellent agreement wth the nventores. 10

12 Dscusson Now that ths technque has been valdated, t may be appled to other chemcal speces n the ar that are not domnated by a sngle source. The results wll help reconcle the emssons nventores to observed concentratons. In obvous extenson of ths method, the wnd drecton analyss presented here can be appled to the source contrbutons estmated from receptor models. The results should be sharper that usng chemcal speces that usually have many sources. Prevous work on comparson of the emssons nventory for the Houston shp channel reled on smple bar charts to determne the drecton of sources Henry et al Later work of ths type should beneft from the methods presented here. Nonparametrc regresson technques could also be used n other ar qualty applcatons, such as trend analyss of tme seres or smply provdng data smoothng for exploratory analyss of ar qualty data. Ths work has demonstrated the usefulness of nonparametrc regresson of ar qualty data on wnd drecton. Nonparametrc regresson allows for accurate determnaton of the wnd drecton of maxmum concentraton. However, ground level concentratons are functon of wnd speed as well as wnd drecton. For elevated sources, ground level concentraton can be a complex functon of wnd speed. Indeed, scenaros could be constructed where the drecton of the maxmum average concentraton does not correspond to the drecton of the source. Such stuatons are probably rare, but a method that ncluded the effects of wnd speed would be desrable. Smultaneous nonparametrc regresson of concentratons on wnd drecton and wnd speed s possble and could help throw lght on, among other thngs, the dstance to the sources and whether the sources are ground level or elevated. Ths wll be the subject of a sequel to ths paper. More can be done wth the confdence ntervals to determne f a peak s real or nose and to estmate the varance of the peak locaton. Ths too wll be the subject of a later paper. 11

13 References Abramowetz, M. and I. Stegun 197. Handbook of Mathematcal Functons, Dover Publcatons, New York p. 879 and p de Haan P On the use of densty kernels for concentraton estmatons wthn partculate and puff dsperson models. Atmospherc Envronment, 33: Härdle W Appled Nonparametrc Regresson, Cambrdge Unversty Press, Cambrdge. Henry R.C., C. H. Spegelman, J. F. Collns, and EunSug Park Reported emssons of organc gases are not consstent wth observatons. Proc. Natl. Acad. Sc. USA, 94: Sommervlle, M. C., S. Murkerjee, and D. L. Fox Estmatng the wnd drecton of maxmum ar pollutant concentratons. Envronmentrcs, 7:

14 Table 1 Emssons of Cyclohexane for 1997 n Harrs Co., TX. Faclty Name Release Type Total Release lbs/year Percent of Total Lattude Longtude Accuracy m Deer Park Azmuth Dstance Km Clnton Drve Azmuth Dstance km Phllps Petroleum Co. STACK Phllps Petroleum Co. FUGITIVE Exxonmobl Baytown Refnery STACK Exxonmobl Baytown Refnery FUGITIVE Enchem Amercas Inc. STACK Lyondell-Ctgo Refnery FUGITIVE Lyondell-Ctgo Refnery STACK Shell Chemcal STACK Valero Refnng Co. STACK Valero Refnng Co. FUGITIVE Westhollow Tech. Center STACK Mllennum Petrochemcal Inc. FUGITIVE Crown Central Refnery FUGITIVE Fmc Corp. FUGITIVE Total Emssons 8488 Table Largest Peaks n the nonparametrc regresson of cyclohexane on wnd drecton, Fgure 5 and Fgure 6. Deer Park Clnton Drve Maxmum Azmuth Azmuth Range Maxmum Azmuth Azmuth Range Peak Peak Peak Peak

15 Fgure 1. Hourly cyclohexane measured at Deer Park durng 1997 versus the azmuth of the wnd drecton. 14

16 Fgure. Bar chart of average cyclohexane n 10 degree bns startng at zero. 15

17 Fgure 3. Movng average cyclohexane concentratons calculated usng a 10-degree wde sldng wndow. 16

18 Fgure 4. Nonparametrc regresson of cyclohexane versus wnd drecton usng a Gaussan kernel wth a 10 degree FWHM. Data wth wnd speed less than 1 mle per hour are excluded. The gray regon s the 95 percent confdence nterval. 17

19 Fgure 5. Nonparametrc regresson of cyclohexane at Deer Park usng a Gaussan kernel wth a FWHM of 5. Data are restrcted to perods wth wnd speed greater than 6 mles per hour about 1 hour travel tme from the largest source to the ste. The gray regon s the 95 percent confdence nterval. 18

20 Fgure 6. Nonparametrc regresson of cyclohexane at Clnton Drve usng a Gaussan kernel wth a FWHM of 10. Data are restrcted to perods wth wnd speed greater than 5 mles per hour about 1 hour travel tme from the largest source to the ste. The gray regon s the 95 percent confdence nterval. 19

21 Fgure 7. Map of 1000 bootstrap locatons of the cyclohexane source black dots. The locaton of the Phllps Petroleum source n the nventory s shown as a +. The Deer Park ste s the x and the Clnton Drve ste s the *. The gray areas are water bodes. The crcles are VOC sources wth the area proportonal to the annual emsson rate. The cloud of black dots does not hde any sources. Ralroads are dash-dotted lnes. 0