Using Logistic Regression to Predict Wheat Yield in Western Zone of Haryana

Using Logistic Regression to Predict Wheat Yield in Western Zone of Haryana Sudesh 1*, P. Verma 2 and U. Verma 1 1* PG Scholar, CCS Haryana Agricultural University, Hisar, India 2 PG Scholar (2014-16), IIT Kharagpur, India Abstract The effect of weather variables on wheat yield has been evaluated for Hisar, Bhiwani, Sirsa and Fatehabad districts comprising the western zone of Haryana. Zonal wheat yield models based on fortnightly maximum temperature, minimum temperature, rainfall, relative humidity and sunshine hours of the period 1978-79 to 2009-10 have been attempted within the framework of multiple linear regression and ordinal logistic regression. The probabilities obtained through ordinal logistic regression along with trend predicted yield were used to develop zonal yield models. The validity of the contending models were tested for the post sample period(s) 2010-11 to 2014-15. Evaluation of the forecasts were done by percent deviations from the real-time data and root mean square error. Key words: Multiple linear regression, ordinal logistic regression, wheat yield forecast, percent relative deviation, root mean square error Introduction Timely and effective pre-harvest forecast of the crop yield is important for advance planning, formulation and implementation of policies related to the crop procurement, distribution, price structure and import-export decisions etc. These are also useful to farmers to decide in advance their future prospects and course of action. The yield of any crop is affected by technological change and weather variability. It can be assumed that the technological factors will increase yield smoothly through time and therefore, year or some other parameters of time can be used to study the overall effect of technology on crop yield. Weather variability both within and between seasons is another uncontrollable source of variability in crop yield. Therefore, model based on weather and year as explanatory variables can be effectively used for forecasting crop yield. The official forecasts/advance estimates of major cereal and commercial crops are issued by the Directorate of Economics and Statistics, Ministry of Agriculture, New Delhi. However, the final estimates are given a few months after the actual harvest of the crop. Thus, one of the limitations of state Department of Agriculture (DOA) yield estimates is timeliness and quality of the statistics. Hence, there is a considerable scope of improvement in the conventional system. Various statistical approaches are in vogue for arriving at crop forecasts. Every approach has its own advantages and limitations. Regression analysis is one of the most widely used statistical techniques for modelling multifactor data. The inference drawn from multiple regression model often depends on the estimates of the individual regression icients. However in some situations, the problem of multicollinearity exists when there are near linear dependencies among the regressors. On the other hand, regression models using time series data occur quite oftenly and the assumption of uncorrelated or independent errors for time series data is often not appropriate. Thus, its use is limited to those settings where the normal distribution is valid and the assumption of a linear function relating the response to the predictors is given. Logistic regression (LR) is an increasingly popular statistical technique used to model the probability of discrete (i.e., binary or multinomial) outcomes. These models also show the extent to which changes in the values of the attributes may increase or decrease the predicted probability of event outcome. LR is part of a category of statistical models called generalized linear models. This broad class of models includes ordinary 33 Sudesh, P. Verma and U. Verma

regression and ANOVA, as well as multivariate statistics such as ANCOVA and log linear regression. LR allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. However, the independent or predictor variables in LR can take any form. That is, logistic regression makes no assumption about the distribution of the independent variables. They do not have to be normally distributed, linearly related or of equal variance within each group. The relationship between the predictor and response variables is not a linear function in logistic regression. Various workers have attempted to develop methodology for weather based models of crop yield forecasting using different techniques. To mention a few; regression models by Agarwal et al. 2001, Dadhwal et al. 2005, Bazgeer et al.2007, Pandey et al. 2013); principal component analysis by Gervini and Rousson 2004, Wang 2012, Alkan et al. 2015, Verma et al. 2015,16; logistic regression by Greenland and Drescher 1993, Ghamdi 2002, Lin et al. 2013, Bergtold and Onukwugha 2014, Tanaka et al. 2015 etc. Wheat is one of the most important cereal crops in India as it forms a major constituent of the staple diet of a large part of the population. India is the second largest producer among wheat growing countries of the World (Source: www.mapsofindia.com/indiaagriculture/). Haryana occupies third place for wheat production among the various states in India (Source: www.agricoop.nic.in/statistics). Haryana is self-sufficient in food grains production and is one of the top contributors of food grains to the central pool. In accordance with the targeted objective, the analysis has been carried out to develop zonal wheat yield models based on weather data by following multiple linear regression and ordinal logistic regression for district-level wheat yield assessment in western zone of Haryana. Study region and statistical methodology The Haryana state comprising of 21 districts is situated between 74 o 25 E to 77 o 38 E longitude and 27 o 40 N to 30 o 55 N latitude. The total geographical area of the state is 44212 sq. km. Hisar, Bhiwani, Sirsa and Fatehabad districts comprising the western zone of the state have been considered for the model building. The Department of Agriculture wheat yield statistics of the study region from 1978-79 to 2014-15 published by Bureau of Economics and Statistics (BES), Haryana were collected for the purpose. Weather data i.e. maximum temperature, minimum temperature, rainfall, relative humidity and sunshine hours of Hisar district for the same period were collected from Department of Agri. Meteorology, CCS HAU, Hisar. Year (time) variable was included to take care of the variation between districts within zone as the weather data were not available for all the districts, however, the zonal model utilized the same weather information in the adjoining districts under the zone. The weather variables affect the crop differently during different phases of its growth period. Thus, to integrate the weather variables over different growth phases, the crop growth period was divided into 11 fortnights. It is basically a winter crop and is grown in the rabi season during October-November to March- April. Data for the last one month of the crop season were excluded, as the idea behind the study was to predict yield(s) about 4-5 weeks before the crop harvest. Thus, weather data starting from 1 st fortnight of November to 1 month before harvest over the period 1978-79 to 2009-10 were utilized for the model building (crop growth period: 1 st November to 15 th April). Five-steps ahead (out of-model development period i.e. 2010-11, 2011-12, 2012-13, 2013-14 and 2014-15) district-level wheat yield forecasts have been obtained from the developed zonal models. The multiple linear regression model considered may be written in the form Y=Xb+ε; where Y is an (n l) vector of observations (DOA yields), X is an (n p) matrix of known form (weather variables & trend yield), b is a (p l) vector of parameters, ε is an (n l) vector of errors and E(ε)=0, V(ε)= Iσ 2, so the elements of ε are uncorrelated. Since E(ε)=0, an alternative way of writing the 34 Sudesh, P. Verma and U. Verma

model is E(Y)= Xb. The normal equations ( X X ) b = X Y are fitted by least squares technique (here Y, X & b are same as above and ( X X ) is the dispersion matrix) providing the solution b 1 (X' X) X' Y. Logistic regression with ordinal response variable If the response variable Y is ordinal, the categories can be ordered in a natural way such as good/moderate/bad. One way to take account of the ordering is the use of cumulative probabilities, cumulative odd and cumulative logits. Considering k+1 ordered categories, these quantities are defined by P(Y i) = p 1 + +p i odds (Y i) = = logit (Y i) = ln ( ), i = 1,,k The cumulative logistic model for ordinal response data is given by Logit (Y ) = α i +β i1 x 1 + +β ip x p, i=1,..,k Thus, we have k model equations and logistic icient β ij for each category/covariate combination. Zonal wheat yield modeling with three groups For this empirical study, crop years were categorized into three groups viz., adverse, normal and congenial. Using weather variables in these three groups, probabilities were obtained by ordinal logistic regression. These probabilities along with trend predicted yield were used for the development of zonal yield forecast models using stepwise regression procedure. When dependent variable has an ordinal nature taking three values say zero, one, two; then the ordinal logistic regression model is given as: and + where P 0 is probability of Y = 0, is probability of Y=1 and P 2 is probability of Y= 2, α s are the intercepts and β i s are the regression icients. Zonal yield forecast models were fitted using stepwise regression taking probabilities P 1 and P 2 along with trend yield as regressors. The model fitted was Yield = a + b 1 P 1 + b 2 P 2 + b 3 T r + ε where a is intercept, b i s are the regression icients, P 1 and P 2 are the probabilities of Y=1 and Y=2, T r is trend yield and ε is error ~ N (0,σ 2 ). The contending models have been compared using 35 Sudesh, P. Verma and U. Verma

i) Percent Deviation (RD%) = {(observed yield forecasted yield)/observed yield}*100, it measures the deviation of forecast yield from the observed yield and ii) Root Mean Square Error (RMSE) as a measure of comparing two models and its formula is 1 2 n 2 1 RMSE O i E i, where O i and E i are the observed and forecasted values of the crop yield and n n i 1 is the number of years for which forecasting has been done. Results and Discussion Zonal yield models were fitted by taking weather variables/estimated response probabilities along with trend yield as regressors and DOA wheat yield as regressand. The best subsets of weather variables were selected using stepwise regression (Draper and Smith, 1981) if they had the highest adjusted R 2 and lowest standard error (SE) of estimate (Table 1). Further, the developed zonal models were used to obtain the district-level wheat yield estimates for the post sample period(s) 2010-11, 2011-12, 2012-13, 2013-14 and 2014-15. Table 1. Parameter estimates and adjusted-r 2 of zonal wheat yield models. Model Intercept X 1 / X 2 / X 3 / X 4 / X 5 / X 6 / X 7 / Adj. R 2 SE Model 1 11.24 T r +0.84 Model 2 10.75 T r +0.69 TMX 1 +1.49 P 1 +0.30 TMX 2-1.35 P 2-74.10 TMN 4 +0.60 SSH 5-0.90 SSH 8-1.03 TMN 8-0.49 0.89 3.21 0.83 3.01 Model 1 - Weather parameters and trend yield as regressors Model 2 - Estimated response probability and trend yield as regressors X 1, X 2,..,X 7 stand for regressors in the model T r - Trend yield TMX - Av. maximum temperature TMN - Av. minimum temperature SSH - Av. Sunshine Hours Pi - Estimated cell probability for response category 1,2,3 SE - Standard error of estimate Table 2. Percent deviations of fitted yield(s) from DOA yield(s) using alternative models. District/ Forecast years Hisar DOA Yield (q/ha) Fitted Yield (q/ha) Model-1 RD (%) Fitted Yield (q/ha) Model-2 RD (%) 2010-11 46.22 48.06 3.97 46.43 0.44 2011-12 50.98 45.79 10.18 47.01 7.79 2012-13 42.73 47.02 10.05 44.40 3.92 36 Sudesh, P. Verma and U. Verma

2013-14 44.77 41.45 7.41 48.20 7.67 2014-15 41.80 47.83 14.42 45.59 9.06 Bhiwani 2010-11 44.65 43.98 1.51 41.43 7.22 2011-12 43.06 41.74 3.07 42.05 2.34 2012-13 40.55 43.00 6.03 39.47 2.66 2013-14 42.28 37.45 11.42 43.30 2.40 2014-15 38.50 43.85 13.90 40.71 5.75 Sirsa 2010-11 51.30 49.82 2.89 48.58 5.30 2011-12 53.57 47.64 11.07 49.28 8.00 2012-13 48.42 48.96 1.12 46.78 3.39 2013-14 53.47 43.48 18.68 50.68 5.22 2014-15 47.06 49.94 6.12 48.18 2.37 Fatehabad 2010-11 50.81 50.53 0.56 49.45 2.68 2011-12 54.72 48.35 11.65 50.15 8.36 2012-13 46.81 49.66 6.10 47.64 1.77 2013-14 53.18 44.18 16.92 51.54 3.08 2014-15 48.58 50.64 4.24 49.03 0.93 DOAyield fittedyield Percentdeviation ( RD%) 100 DOA yield Table 3. Comparative view in terms of average absolute percent deviations and RMSEs of forecast yield(s) based on both the models. Districts Average absolute percent deviation(s) RMSEs Hisar Bhiwani Model-1 Model-2 Model-1 Model-2 9.20 5.78 4.38 2.99 7.18 4.08 3.47 1.92 37 Sudesh, P. Verma and U. Verma

Sirsa Fatehabad 7.97 4.86 5.40 2.74 7.89 3.37 5.18 2.30 The performance(s) of the zonal yield equations were compared on the basis of statistics like adj-r 2, percent deviations of forecasts from the observed yields and RMSEs. The results showed that there is a considerable improvement in wheat yield assessment using ordinal logistic regression and the percent deviations from DOA yields are within acceptable limits. It indicates the usefulness of zonal yield models for district-level wheat yield assessment in western zone of Haryana. Moreover, the developed models provide reliable forecasts of crop yield at least one month in advance of the crop harvest while the DOA yields are obtained quite late after the actual harvest of the crop. References Agarwal, R., Jain, R.C. and Mehta, S.C. (2001). Yield forecast based on weather variable and agricultural inputs on agro-climatic zone basis. Ind. J. Agric. Sci. 71(7), 487-490. Alkan, B.B., Atakan, C. and Alkan, N. (2015). A comparison of different procedures for principal component analysis in the presence of outliers. Journal of Applied Statistics 42(22), 1716-1722. Bergtold, J.S. and Onukwugha, E. (2014). The probabilistic reduction approach to specifying multinomial logistic regression models in health outcomes research. Journal of Applied Statistics 41(10), 2206-2221. Boken, V.K. (2000). Forecasting spring wheat yield using time series analysis. A case study for the Canadian prairies. Agro. J. 92, 1047-1053. Dadhwal, V.K., Sehgal, V.K., Singh, R.P. and Rajak, D.R. (2005). Wheat yield modeling using satellite remote sensing with weather data: Recent Indian experience. Mausam 54, 253-262. Draper, N.R. and Smith, H. (1981). Applied Regression Analysis, 2 nd ed. New York: John Wiley. Gervini, D. and Rousson, V. (2004). Criteria for evaluating dimension Reducing components for multivariate data. The American Statistician 58(1), 72-76. Ghamdi, A.S. (2002). Using logistic regression to estimate the influence of accident factors on accident severity. King Saud University, Saudi Arabia 34, 729-741. Greenland, S. and Drescher, K. (1993) Maximum likelihood estimation of the attributable fraction from logistic models. Biometrics 49, 865-872. Lin, H., Wang, C., Liu, P. and Holtkamp, D.J. (2013). Construction of disease risk scoring systems using logistic group lasso: application to porcine reproductive and respiratory syndrome survey data. Journal of Applied Statistics 40(4), 736-746. Pandey, K.K., Rai, V.N., Sisodia, B.V.S., Bharati, A.K. and Gairola, K.C. (2013) Pre -harvest forecast models based on weather variable and weather indices for Eastern U.P. Advance in Bioresearch 4(2), 118-122. Tanaka, H., Obayashi, C. and Takagi,Y. (2015). On second order admissibilities in two-parameter logistic regression model. Communications in Statistics Theory and Methods 44, 1958-1966. Verma, U., Aneja, D.R. and Hooda, B.K. (2015). Principal component technique for pre-harvest estimation of cotton yield based on plant biometrical characters. J. of Cotton Research and Development 29(2), 339-343. Verma, U., Piepho, H.P., Goyal, A., Ogutu, J.O. and Kalubarme, M.H. (2016). Role of Climatic Variables and Crop Condition Term for Mustard Yield Prediction in Haryana (India). International J. of Agricultural and Statistical Sciences 12(1), 45-51. Wang, W. (2012). Bayesian principal component regression with data-driven component selection. Journal of Applied Statistics 39(6), 1177-1189. 38 Sudesh, P. Verma and U. Verma