21 Joint Meeting of the Forest Inventory nd Anlysis (FIA) Symposium nd the Southern Mensurtionists SAMPLIG ITESITY AD ORMALIZATIOS: EXPRIG COST-DRIVIG FACTORS I ATIOWIDE MAPPIG OF TREE CAOPY COVER John Tipton, Gretchen Moisen, Pul Ptterson, Thoms A. Jckson, nd John Coulston ABSTRACT There re mny fctors tht will determine the finl cost of modeling nd mpping tree cnopy cover ntionwide. For exmple, pplying normliztion process to Lndst dt used in the models is importnt in stndrdizing reflectnce vlues mong scenes nd eliminting visul sems in the finl mp product. However, normliztion t the ntionl scle is expensive nd logisticlly chllenging, nd its importnce to model fit is unknown. Cost lso increses with ech loction smpled, yet pproprite photo smpling intensity reltive to the FIA grid hs yet to be explored. In ddition, cost is lso ffected by how intensively the photo plots themselves re smpled with dot count, nd the effect of reducing the number of dots on predictive models is lso unknown. Using intensively smpled photo plot dt in 5 pilot res cross the United Sttes, we ddress these three cost fctors by exploring the effect of normliztion process of Lndst TM dt on model fits of tree cnopy cover using Rndom Forests regression, the reltionship between the smpling intensity of photo interpreted plots nd model fit, nd the reltionship between the number of dots for ech photo interpreted loction nd model fit. ITRODUCTIO The tionl Lnd Cover Dtbse (LCD, http://www. mrlc.gov/) for 211 will contin mp of tree cnopy cover tht will be sptilly explicit mp-bsed dt on percent tree cnopy cover is used for forest mngement, estimtes of timber production, determining the potentil for nd extent of fire dnger nd other mngement issues cross the United Sttes. The 21 LCD provides mp-bsed estimtes of percent tree cnopy cover long with lnd cover nd percent impervious cover (Homer nd others 24 ). The LCD is periodic product with n updte cycle of pproximtely five yers. However, becuse of funding constrints the percent tree cnopy estimtes were not updted for the 26 LCD. For the 211 LCD the U.S. Forest Service Forest Inventory nd Anlysis progrm (FIA) will tke the led on developing the percent tree cnopy cover lyer. FIA is uniquely positioned to led the development of the 211 LCD percent tree cnopy cover lyer. First, FIA uses probbilistic smple design tht covers ll lnds (forest nd non-forest) nd cn be esily intensified for geosptil modeling purposes. Second, the FIA progrm is beginning to mke percent tree cnopy cover estimtes for ll smple loctions. This provides n opportunity to leverge dt collected s prt of the FIA progrm to develop predictive models used to produce percent tree cnopy mp products. To this end, pilot study ws crried out in 21. The pilot study ws designed to nswer specific reserch questions nd estimte costs for developing the 211 LCD percent tree cnopy cover mp. Creting tree cnopy cover product tht encompsses the entire country presents mny questions tht must be nswered before prototype or production mpping cn begin. Consequently, pilot project ws lunched tht included five study res, one ech in Georgi, Michign, Knss, Oregon, nd Uth. Within ech study re, over two thousnd photo plots were photo-interpreted by n interpreter looking t grid overlid on n eril photo of ech plot. At ech of the 15 points on the grid, the interpreter determined if the point ws tree or not, nd this response ws used to clculte percent tree cover. Using dt from the pilot project, severl issues re ddressed in this pper to support production of mpping of tree cnopy cover ntionwide. First, the number of smples plys n importnt role in the qulity of the model. It is importnt to find blnce between the qulity of model fit nd concerns of cost. Second, normliztion of Lndst TM imges is importnt becuse djcent Lndst scenes on mp re not tken on the sme dy. Becuse of this, when mosic of multiple imges is constructed, there will be John Tipton, Colordo Stte University, Fort Collins, CO 8525 Gretchen G. Moisen, Reserch Forester, US Forest Service Rocky Mountin Reserch Sttion, Ogden, UT, 8441 Pul Ptterson, Sttisticin, US Forest Service Rocky Mountin Reserch Sttion, Ogden, UT, 8441 Thoms A. Jckson, Colordo Stte University, Fort Collins, CO 8525 John W. Coulston, Reserch Forester, US Forest Service Southern Reserch Sttion, Knoxville, T, 37919 21
Cover Estimtion sems in the imge where the rw reflectnce vlues for one imge re not equl to the reflectnce vlues of the djoining imge. ormliztion of one imge to nother using the overlp between two imges will remove the visul sem in mp, but the effect of normliztion on how well model predicts percent tree cnopy cover hs not been explored. Third, t ech smple loctions n estimte of percent tree cnopy cover ws mde using simple dot grid pproch. The pilot study design used 15 dots however, if the sme informtion cn be obtined with fewer points, we cn trim costs nd mintin the qulity of the model. Consequently in this pper, we explore the effects of smple size, normliztion nd number of dots on predictive models of tree cnopy cover. METHODS Percent tree cnopy cover dt ws collected for five study res in the coterminous United Sttes (Figure 2). The stndrd FIA smpling grid ( 1 plot per 24 h) ws intensified fourfold to l plot per 6 h using the techniques described by White et l. ( 1992 ). At ech smple loction 15 point dot grid covering 9m by 9m re ws developed. At ech of the 15 points, photo interpreter determined if the point ws tree or not, by exmining high resolution digitl eril photogrphy collected in 29 (USDA 29). The percent tree cnopy cover for ech smple loction ws defined s the number of points intersecting tree crowns divided by 15 nd ws used s the dependent vrible for rndom forest model development. The independent vribles cme from vriety of sources but they were primrily Lndst 5 dt nd vegettion indices derived from Lndst dt (e.g. normlized difference vegettion index, tsseled cp). Additionlly, digitl elevtion models nd derivtives (e.g. slope, spect) were lso used s potentil independent vribles for rndom forest model development. The Lndst dt were vilble s normlized mosics nd non-normlized mosics. Becuse ech study re covered multiple Lndst scenes differences in spectrl vlues mong scenes rise becuse of differing collection dt nd tmospheric effects. The non-normlized dt hd no correction for these effects. The normlized dt ccounted for these effects by stndrdizing reflectnce vlues from trget scene to reference scene bsed on the overlp mong scenes. The specific modeling tool used ws Rndom Forests, implemented in R using the librry RndomForests (Liw nd Wiener 22). Rndom Forests is mchine lerning process tht uses decision trees for clssifiction nd regression. The lgorithm computes mny trees, with ech tree getting "'vote," with the finl model being mjority decision (ctegoricl vribles) or verge (continuous vribles). For ech node in those trees, subset of explntory vribles is rndomly selected nd dichotomous split in the dt is mde bsed on the lrgest decrese in the MSE of the dt. To get the finl model, the process is run for 5 trees, nd the results re verged. Ech tree is constructed using rndomly selected set of the dt where pproximtely one-third is held "out of bg" nd cn, therefore, be used s vlidtion dt set nd s mesure of model fit. Our mesure of model fit is clled pseudo R 2 nd it represents the reltive mount of vrition in the dt tht is explined by the model. Pseudo R 2 is clculted s 1-MSEr(y) where the pseudo R 2 is clculted individully for every tree in the forest, then verged over ll trees to compute the finl vlue. To investigte the question relted to smple size, we performed n itertive smpling process where, for ech itertion, plots were rndomly smpled from our study site, model is fit using the RndomForest commnd, nd the mesure of model fit (pseudo R 2 ) is recorded. Then, for the next itertion, the number in the smple ws incresed by 2 plot loctions nd so on until the number in the itertive smple equled the totl smple size for the study site. When plotting the pseudo R-squred vlues ginst the number of study site smples, we pplied lowess smoothing cutve for ech of the study site loctions to get visul indiction of the symptotic behvior. From this method, we were ble to get estimtes of the vrince of the fit of the model s well s to determine the symptotic behvior of model fit reltive to smple size. The simultions described bove were performed for both the dt set tht ws normlized (corrected for differences in Lndst scenes) nd for the dt set tht ws not normlized. This llowed us to lso explore the symptotic behvior of the model fit reltive to normliztion. The finl question hd to do with the number of dots used for the photo interprettion grid. For ech study site we smpled 5, 1, nd 15 study loctions nd clculted the percent tree cover bsed on rndomly smpling number of photo dots. We strted with smpling one dot, nd then fit Rndom Forest model nd recorded the pseudo R 2 The process ws then iterted, incresing the number of dots by one ech time. In the plot of model fit versus the number of dots, we pplied lowess smoothing cutve to see ptterns in the simultions nd to get visul indiction of the symptotic behvior reltive to number of dots. Also, estimtes of the number of mn-hours needed to complete prototype of the sme size with different number of smple plots nd numbers of dots were produced. This ssumed 3 minutes for loding ech smple plot picture nd nother 3 minutes to count ll 15 dots. 22
21 Joint Meeting of the Forest Inventory nd Anlysis (FIA) Symposium nd the Southern Mensurtionists RESULTS From these simultions we were ble to get n understnding of wht intensity smpling intensity provides the most informtion for the lest cost. In Figure 2 we cll ttention to the smoothed curve of pseudo R 2 versus the number of smple plots for the non-normlized dt in Oregon. Looking t the spred of the simulted model fits we see tht between 1 nd 2 smple plots the vrition in simulted pseudo R 2 drops off quickly. This is of interest becuse the defult FIA smpling intensity grid for this study site is pproximtely 15. A similr pttern is seen in Figure 3, in which the vrition in simulted pseudo R2 drops off quickly between 1 nd 2 smple plots for the other four study sites. Figures 1 nd 2 lso show the effect of normliztion on model fit. When looking t the plot of smple size versus pseudo R 2 for Oregon in Figure 2 we see tht there is little difference in the fit of our model with regrds to whether the dt ws normlized or not normlized. When looking t the four plots in Figure 3 we see the sme pttern in Georgi, Uth, nd Michign, but we hve different results in Knss. In the Knss plot we see tht the normlized dt model outperforms the normlized dt model, but the difference is smll (t 4 smple points the difference in pseudo R 2 between the normlized nd non-normlized models is bout.3). These results indicted tht normliztion plys very minor role in the qulity of model fit, nd we mde the decision to consider only the non-normlized dt set for the rest of the nlyses. In Figures 3 nd 4 we re looking t the plots of pseudo R 2 versus number of dots on the photo grid. By looking t the plots of number of dots versus pseudo R 2 in Figure 4 we see tht in Oregon we re not getting more informtion by including more thn 4 dots. This is evidenced by the inflection in the lowess smoothing curve on the plot. The sme pttern is repeted in Figure 5 for the other study sites. By combining the recommendtions of using nonnormlized dt nd roughly 1 smple plots per study site we re ble to mke estimtes of the mount of mnhours needed to complete study site of similr size. Figure 6 shows the mount of person-hours needed versus the number of photo interprettion grid dots for 5, I, nd 15 smple plots. Using our ssumptions tht ech imge tkes three minutes to lod nd three minutes to clculte tree cover using ll 15 dots, we plotted the number of photo grid dots versus time for 5, 1, nd 15 smple plots. From this we cn see tht if we used 1 smple plots with 4 dots we would expect one person to finish ll five study res in bout 12 weeks. DISCUSSIO By looking t the smoothed curves for the non-normlized dt in Figures I nd 2 we see the reltionship between the number of smple plots nd the precision of the model fit s mesured by pseudo R 2 We see tht between I nd 2 smple plots the vrition in pseudo R 2 decreses rpidly versus the number of smple plots when compred to lrger smple sizes. This suggests diminishing returns in model fit when incresing the number of smple plots beyond vlues in the 1 to 2 rnge. This suggests tht we cn get good model reltive to cost in the 1 to 2 smple plot rnge, which lso hppens to be pproximtely the FIA stndrd smpling intensity grid for ech study site. Choosing to use only non-normjized dt to fit Rndom Forests model hs mjor implictions for the budget of the project. ormliztion is n expensive nd time consuming process, especilly on scle the size of the entire United Sttes. Our results indicte tht the Rndom Forests model performs eqully well using either normlized or nonnormlized dt. From this result, we re ble to mke recommendtions to get higher qulity product for less cost. However, the visul effects of not normlizing re still under investigtion. Becuse humn observer will be used to mesure percent tree cover in the finl product, using fewer dots will decrese the time the observer will spend on ech photo, which will decrese the overll cost of the project. Since it ppers tht we gin little in terms of model fit when considering more thn 4 dots, this suggests tht we cn reduce the person-hours needed for the prototype. COCLUSIO Becuse there re limited resources vilble it is importnt to get n understnding of the behvior of the smpling protocols nd model fits reltive to the costs of the process. The recommendtions in this pper give guidelines for the next prototype phse of the LCD Cnopy Cover project. ACKOWLEDGEMETS We would like to thnk everyone who hs worked with this pilot project for collecting the dt, tking the time to photo interpret the imges, nd provide counsel on this project. Also we would like to thnk Dr. Jen Opsomer for his time nd ssistnce. 23
Cover Estimtion LITERATURE CITED Liw, A.; Wiener, M. 22. Clssifiction nd Regression by rndomforest. R ews 2(3), 18-22. Breimn, L. 21. "Rndom Forests." Mchine Lerning. 5-32. Breimn, L.; Friedmn, R. A. ; Olshen, R. A.; Stone, C. G. 1984. Clssifiction nd regression trees. Pcific Grove, CA, USA: Wdsworth. Homer, C.; Hung, C.; Yng, L. [nd others]. 24. Development of 21 ntionl lnd-cover dtbse for the United Sttes. Photogrmmetric Engineering nd Remote Sensing. 7, 829-84. U.S. Deprtment of Agriculture. 29. tionl Agriculture Imgery Progrm, Slt Lke City; U.S. Deprtment of Agriculture, Frm Service Agency, Aeril Photogrphy Field Office. Informtion vilble: http:// www.pfo.usd.gov/fsa/pfopp?re=home&subject=prog&topic=ni White, D.; Kimerling, A.J.; Overton, W.S. 1992. Crtogrphic nd geometric components of globl smpling design for environmentl monitoring. Crtogrphy nd Geogrphic Informtion Systems 19: 5-22. Figure 1-Loction nd extent of the five pilot study res. Oregon R 2 vs Smple Size < lq u ::::l CJ) (J).. c:i '<;f c:i OA A o ormlized dt A on-ormlized dt 1 2 3 4 5 umber of Plots Smpled Figure 2-Shows the pseudo-r 2 vlues plotted ginst the number of plots smpled for Oregon for both the nomlized nd non-normlized dt sets with the solid lines representing lowess smoothing curve. 24
Cover Estimtion Oregon R 2 vs Smple Size <D + :: "C :::s Q) CJ) Q_ 6. 6. f). oo A f1 cj) 6> A AA A o 5 plot 6. 1 6 plots + 15 plots 2 4 6 8 1 umber of Dots Smpled Figure 4-Shows the pseudo-r 2 vlues plotted ginst the number of dots smpled for Oregon, for both the 5, 1, nd 15 smple plots with the solid lines representing lowess smoothing curve. 26
2 I Joint Meeting of the Forest Inventory nd Anlysis (FIA) Symposium nd the Southern Mensurtionists Georgi R 2 vs umber of Dots Knss R 2 vs umber of Dots :: ""C :::::s Q) f/).. co c.o 5 plots 1:1 1 plots + 15 plots :: l() l() l()... ""C :::::s Q) f/).. (T) 2 4 6 8 1 2 4 6 8 1 umber of Dots Smpled umber of Dots Smpled Michign R 2 vs umber of Dots Uth R 2 vs umber of Dots :: ""C :::::s Q) f/).. c.o 5 plots 6. 1 plots + 15 plots :: ""C :::::s "": Q) c.o f/).. 2 4 6 8 1 2 4 6 8 1 umber of Dots Smpled umber of Dots Smpled Figure 5-Shows the pseudo-r2 vlues plotted ginst the number of dots smpled for Georgi, Knss, Michign, nd Uth, for both the 5, 1, nd 15 smple plots with the solid lines representing lowess smoothing curve. 27
Cover Estimtion umber of Dots vs Time U') 5 plots 1 plots o 15 plo s U).::.: CD CD. E z 2 4 6 8 1 umber of Dots Figure 6-Shows the mount of time to complete prototype of similr size toon the five study sites versus the number of dots used in photo interprettion for 5, 1, nd 15 smple plots. 28