Application of Non-Parametric Kernel Regression and Nearest-Neighbor Regression for Generalizing Sample Tree Information

Size: px
Start display at page:

Download "Application of Non-Parametric Kernel Regression and Nearest-Neighbor Regression for Generalizing Sample Tree Information"

Transcription

1 This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Application of Non-Parametric Kernel Regression and Nearest-Neighbor Regression for Generalizing Sample Tree Information Annika Kangasl and Kari T. Korhonen2 Abstract - Usually in a forest survey, large part of the measured trees are tally trees from which only elementary characteristics are measured. Part of the trees are sample trees, which are measured more thoroughly. To be able to utilize tally trees efficiently in the calculations the information available for sample trees has to be generalized for tally trees. The generalized information should be unbiased for areas of arbitrary size (or arbitrary groups of data). Also, the variation between sample plots and within sample plots should be realistic, if this data are used as a basis for simulations of forest development. These requirements may often be contradictory. In this paper applications of non-parametric kernel regression and nearest-neighbor regression for generalizing sample tree information are discussed. These methods may provide a satisfactory compromise between these requirements. INTRODUCTION Two- or multi-phase sampling is applied in most forest inventory systems. The first phase sample (tally trees) consists of a large number of trees for which diameter and other easily measurable characteristics are measured. The second phase sample (sample trees) consists of trees measured more thoroughly. Height, age and additional diameters are the most typical characteristics measured for the sample trees. A third phase sample may be collected to derive volumes or biomass from sample tree characteristics. (Cunia 1986). If two phase sampling is applied, the sample tree information has to be generalized for tally trees. This means that for every tally tree an expected value ' Research Scientist, Finnish Forest Research Institute, Kannus Research Station, P.O. BOX 44, FIN Kannus, Finland * Research Scientist, Finnish Forest Research Institute, Joensuu Research Station, P.O. BOX 68, FIN Joensuu, Finland

2 of each sample tree characteristic with respect to measured characteristics is given. Most methods used are based on regression techniques (Korhonen 1993, Korhonen 1992, Cunia 1986, Kilkki 1979). The advantage of using regression models is that unbiased estimates for the population parameters are easily obtained. The unbiased estimate of mean does not suffice, for example, if the predicted values are used as a data base for simulating the future development of forests (Ranneby and Svensson 1990). It is especially important to retain the natural variation, if the models used for predicting future development are non-linear (Moeur and Stage 1995). In this case, the different tree characteristics should harmonize, the between-plot and within-plot variation of the predictions should be as realistic as possible, the treewise and standwise results should be unbiased and, in addition, the mean in the population should be unbiased. These requirements are often contradictory. It is difficult to formulate the model so that it is precise enough, provides realistic image of forests and gives unbiased estimates for arbitrary groups. A method, which is optimal in one respect may be unacceptable in other respects. What is acceptable, depends on the situation. One example of a problem of this kind is describing the effect of geographic location on stem form. In the paper of Korhonen (1993) it was demonstrated that, in Finland, the stem form of Scots pine (Pinus sylvestris) depends on the geographic location besides the tree and stand characteristics. The traditional solution for this problem is to estimate separate models for areas of interest. In this approach, however, the effect of location cannot be taken into account for areas smaller or larger than the areas pre-defined and the effect of location is not continuous. Another solution is to take the location as a regressor to the applied model. A quadratic trend surface, for example, can be used for describing the large scale variation (Korhonen 1993). With the kriging method also the smallscale correlations can be taken into account (Ripley 1981). Unfortunately, the kriging method may be impractical for large data sets. Often the models for generalizing different sample tree characteristics are estimated separately. Thus, it is possible to obtain values for volume and volume increment, which result in unrealistic estimate for volume increment percentage. Standard estimation procedures minimize componentwise losses, but what is needed is an ensemble of parameters (Louis 1984). Solution for this situation may be a system of simultaneous equations. Non-parametric models can offer flexible solutions for generalizing sample tree data for different kind of situations. With non-parametric methods the effect of location can be taken into account with a simple procedure. It is also easy to generalize all the sample tree characteristics at the same time. Thus, it is easier to retain the covariance structure between the different tree characteristics. In this paper an application of semiparametric kernel regression and nearest-neighbor regression for generalizing sample tree information are discussed. The goal of this

3 paper is to compare the presented non-parametric methods from theoretical point of view and also with an empirical example. METHODS Kernel regression approach A non-parametric estimate of a variable y at a point i is the weighted average of the measured values of y. The weight of a sample point depends on the differences in the values of the independent variables between the point of interest and the sample points. In this study a non-parametric regression model (Nadaraya 1964) was used, where y is the dependent variable, x, (xi) is a vector containing the values of the independent variables at point j (i), h is the window-parameter and K is the kernel function. In this study the kernel function used was a multivariate normal density function (Silverman 1986) where d is the dimension of the distribution (number of elements in vector xj in 1 ). The use of non-parametric kernel regression in the case of several independent variables is, however, difficult. When the number of independent variables increases, the data set may be surprisingly sparsely distributed in a highdimensional Euclidean space. Thus, the applicable window-parameter values are too large to describe the relationships between the dependent and independent variables properly. Further, the model becomes difficult to interpret and impossible to demonstrate in a graphic form. (Hiirdle 1989). Several methods for overcoming this problem have been presented, for example, by considering the linear combinations of the independent variables (see Hardle 1989, Moeur and Stage 1995). One obvious solution for the problem of several independent variables is to use semiparametric methods, i.e. combination of parametric and non-parametric methods. In this study, the residuals of a parametric model were smoothed with non-parametric kernel regression with coordinates as independent variables, in order to obtain localized estimates.

4 Nearest-neighbor approach A nearest neighbor estimator is a (weighted) average of k nearest neighbors of point j. In this study the weight function used was where a and k are parameters. The nearest neighbors are defined by some distance measure, in this study the nearest neighbors are those with largest weights. The parameter a, defines the relative importance of the independent variable xm. Another parameter of the nearest neighbor estimator (Eq. 3) is the number of neighbors included, k. The kernel method and nearest-neighbor method are closely related methods. In nearest-neighbor approach the size of the neighborhood may vary, whereas in non-parametric kernel regression the size of the neighborhood is fixed and the number of neighbors varies. Nearest neighbor method is equivalent to kernel method with varying window width. The dimensionality problem in the nearest neighbor approach does not seem to be as serious problem as in the kernel method, because of the varying window width. The methods differ also with other respects. For example, the nearest-neighbor produces a slightly rougher curve (Hardle 1989). This is due to discontinuity of the function. As the window moves, new observations enter to the k neighbors. If the new neighbor differs from the current average, there will be an abrupt change in the value of nearest-neighbor regression. Also, due to the fixed number of neighbors nearest neighbor method can be expected to be more biased near the boundaries than the kernel method. Optimal values of the parameters a, can be searched with cross validation method (Altman 1992). In this method each observation is predicted with data excluding the observation itself. Estimator with lowest mean of squared residuals is regarded best. In this study, the parameters used in this study are from Korhonen and Kangas (1995).

5 AN EXAMPLE Material The data used in this study were the pine sample trees of 8" NFI of eastern Finland. Only trees growing on site class I1 were included in the data. The data consist of 2063 pines measured on 375 plots. Diameter at breast height and height were measured for the sample trees used in this study. Stem volumes were calculated using measured dimensions and volume functions of Laasasenaho (1982). For each plot several variables describing the site and growing stock were registered in the NFI data. These variables include location, altitude, site class, basal area of growing stock, dominant tree species, mean diameter and age of growing stock etc. Results The nearest neighbor estimator was compared with the parametric and semiparametric estimator. The volumes of sample trees of NFI8 were estimated with other trees in the same data set. Mean and standard deviation of residuals of volume estimates were calculated using parametric estimator (Eq. 5), nearest neighbor estimator (Eqs. 3, 4) and semiparametric estimator (Eqs. 1, 2) with different values of parameters. The within-plot and between-plot variance components of volume were estimated to test how the different methods retain the initial variation in the data. As the parametric estimator and the parametric part of the semiparametric model a function was used. In Eq. 5 v,, is the volume of tree i on plot j (dm3), dij is the diameter at breast height (cm), ln(g,) is the natural logarithm of the basal area of the growing stock of the plot j (m2/ha), T, is the mean age and dm, is the mean diameter in the plot J. The only independent variables in the non-parametric part of the model were the coordinates. For the nearest neighbor approach the independent variables were d2,, G,, T,, 4, and the coordinates. The b-parameters for parametric model and weights a, for nearest neighbor approach are presented in table 1.

6 Table 1. The variables in parametric model and their coefficients and, the variables in the nearest-neighbor regression and their weights. Variable Coefficient Variable Weight -- - int d d G d T In@) T x dm Y The results were calculated with several different window widths h for semiparametric approach and with different number of neighbors k for nearest neighbor approach. The results are presented in tables 2 and 3. Table 2. Mean and standard error of residuals of volume and between-plot and within-plot standard deviations of volume predictions with different window widths for semiparametric approach. The between-plot and within-plot standard deviations of true volumes are and 179.6, respectively. Window Mean of Std. error Std(p1ot) Std(tree) width residuals parametric Table 3. Mean and standard error of residuals of volume and between-plot and within-plot standard deviations of volume predictions with different number of neighbors for nearestneighbor approach. The between-plot and within-plot standard deviations of true volumes are and 179.6, respectively. Number of Mean of Std. error Std(p1ot) Std(tree) neighbors residuals The parametric estimates of volume are unbiased in the whole area, whereas 636

7 the non-parametric estimates generally are not unbiased. The smallest standard errors are obtained with semiparametric approach, using quite small window widths. The largest standard errors were obtained with nearest-neighbor method. On the other hand, with parametric regression estimator the between-plot variation decreases and within-plot variation increases, when compared to the true variations. The semiparametric approach has no effect at all on the variance components. The most realistic variation is obtained with nearest-neighbor method with only one neighbor. From these results it can be concluded that the preferable method for generalizing sample tree information depends on the situation. DISCUSSION In the standard estimation methods, like regression analysis, the primary goal is to obtain as accurate estimates as possible for individual observations. Thus, variance or MSE of individual predictions is minimized. In forest inventory, however, the goal is to obtain unbiased estimates of several parameters for arbitrary groups of observations. Consequently, minimizing the variance or MSE is not enough. The other criterias may be even more important than the MSE of individual observations. Also, the different requirements that are set to the results of generalization method may be contradictory. In this study, in addition to the variance and bias also within-plot and betweenplot variation of the predictions were considered. Other components, such as subgroup biases and variances could also have been considered. In this study the different components were not combined in order to find an optimal method, but it would be possible to define a "utility function" of the different components and their relative weights. With the semiparametric approach the estimates of sample tree characteristics can be localized in order to obtain better estimates for subareas (-groups) of data (Kangas and Korhonen 1995). This is also true for nearest neighbor method. But in addition, with the nearest neighbor method the initial structure of the data can be retained fairly well (see also Korhonen and Kangas 1995). This is due to the fact that all the sample tree characteristics can be generalized at the same time. With a purely non-parametric kernel regression method this would also be possible. This approach, however, requires special attention to dimensionality problem, for example smoothing in one dimension with a linear combination of variables. REFERENCES Altman, N.S An Introduction to Kernel and Nearest-neighbor Nonparametric Regression. The American Statistician 46: Cunia, T Error of forest inventory estimates: its main components. In: Estimating Tree Biomass Regressions and Their Error. Proceedings of the

8 Workshop on Tree Biomass Regression Functions and their Contribution to the Error of Forest Inventory Estimates. May 26-30, Syracuse, New York. pp Nadaraya, E.A On Estimating Regression. Theory of Probability Application. No. 9: Hardle, W Applied nonparametric regression. Cambridge University Press. 323 pp. Kangas, A. and Korhonen, K.T Generalizing sample tree information with semiparametric and parametric models. Silva fennica 29(2): Kilkki, P An outline for a data processing system in forest mensuration. Silva Fennica 13(4): Korhonen, K.T Calibration of upper diameter models in large scale forest inventory. Silva Fennica 26(4): Korhonen, KT Mixed estimation in calibration of volume functions of Scots pine. Silva Fennica 27(4): Korhonen, K.T. and Kangas, A Application of nearest-neighbor regression for generalizing sample tree information. Manuscript, 16 p. Laasasenaho, J Taper curve and volume functions for pine, spruce and birch. Communicationes Instituti Forestalis Fenniae pp. Louis, T.A Estimating a population of parameter values using Bayes and empirical Bayes methods. J. Am. Stat. Ass. 79(386): Moeur, M. and Stage, A.R A most similar neighbor: an improved sampling inference procedure for natural resource planning. For. Sci 4 1 (2) : Ranneby, B. and Svensson, S.A From sample tree data to images of tree populations. In: Forest inventories in Europe with special reference to statistical methods. Proceedings of the International IUFRO S and S. 604 Symposium, May 14-16, Swiss Federal Institute for Forest, Snow and Landscape Research. Birmensdorf, Switzerland. Silverman, B.W Density Estimation for Statistics and Data Analysis. London. Chapman & Hall. BIOGRAPHICAL SKETCH Annika S. Kangas is research scientist, Finnish Forest Research Institute, Kannus Research Station, Finland. She holds a D.Sc. in Forestry from University of Joensuu. Annika deals with forest inventory methods, especially with model based sampling techniques. Kari T. Korhonen is research scientist, Finnish Forest Research Institute, Joensuu Research Station, Finland. He holds a D.Sc. in Forestry from University of Joensuu. Kari deals with forest inventory methods, especially with generalizing sample tree information.