ST7002 Optional Regression Project. Postgraduate Diploma in Statistics. Trinity College Dublin. Sarah Mechan. FAO: Prof.

Size: px
Start display at page:

Download "ST7002 Optional Regression Project. Postgraduate Diploma in Statistics. Trinity College Dublin. Sarah Mechan. FAO: Prof."

Transcription

1 ST72 Optional Regression Project Postgraduate Diploma in Statistics Trinity College Dublin Sarah Mechan FAO: Prof. John Haslett School of Computer Science & Statistics

2 Section 1 Introduction This report examines an agricultural study that was initiated to explore and develop soil based nitrogen (N) tests, (6 soil N tests evaluated) for assessing soil N mineralisation capacity and grassland production potential. The objective of this study is to develop an improved framework for soil N testing that is both practical and rapid. The developed soil N testing framework will be a basis for quantifying additional N fertiliser application advice on farms, and will account for soil N reserves in the similar manner as established soil P & K tests do for these elements. The current gold standard soil N test (a biological soil N test) has proven to accurately predict soil mineralization potential however, it is expensive and time consuming and not practical for routine soil testing purposes. In this research our assumption is that Net Mineralisation (mg NH 4 -N kg-1) as measured by the biological soil N test will accurately quantify potential N mineralisation for our collected 37 soil types. Therefore we will use regression analysis to evaluate 6 candidate soil N tests (predictor variables) for their ability to predict Net Mineralization (response variable) for these soil types. Additional information on soil texture (i.e. % sand, silt, clay) and soil drainage (well drained vs. poorly drained) properties were collected. Section 2 Analysis Initial analysis of the data All 6 soil N tests and the additional soil properties were plotted to view the data distribution and summary statistics. These 6 soil N tests were then individually regressed against Net Mineralization for these 37 soils in order to assess the strength of their individual relationships (analysis not shown). The important variables identified from this initial analysis were as follows. Variables Response: Predictors: Net Mineralisation (mgnh 4 -N kg-1) ISNT (mg NH4-N kg-1) [Soil Nitrogen Test] Sand (%) [Soil Texture Test] Analysis 1 Minitab Version 15 Statistical Package was used to run multiple linear regression on the above Predictors (n=37). Regression Analysis: Net Mineralisation versus ISNT, SAND The regression equation is Net Mineralisation (mgnh4-n kg-1) = ISNT (mg NH4-N kg-1) SAND (%) Constant ISNT (mg NH 4 -Nkg-1) SAND (%) S = R-Sq = 84.9% R-Sq(adj) = 84.% 1

3 Net Mineralisation mgnh4-n/kg Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS % explained ISNT (mg NH 4 -Nkg-1) SAND % The regression analysis shows that both Predictors are statistically significant with P-values <.5 and T-values > +/- 2. The results show a strong positive correlation between ISNT and Net Mineralisation and a negative correlation between Sand and Net Mineralisation. Plotting Sand against ISNT (not shown here) showed there was no correlation between the two predictors as expected. Also the Variance Inflation Factor (VIR) is very close to 1 in both which indicates that the predictors are not correlated. It is expected that sandier soils would have lower levels of Soil Organic Matter (SOM) a constituent in the soil from which soil N is derived and consequently sandier soil would have less ISNT-nitrogen (ISNT-N). The Sequential Sum of Squares (Seq SS) illustrates that ISNT explains ~82% of unique variation in the model with % Sand explaining only ~3%. Using Minitab Fitted Line Plots were created to visually illustrate the relationship between ISNT, % Sand and Net mineralisation Fitted Line Plot Net Mineralisation mgnh4-n/kg = ISNT mg NH4-N/kg S R-Sq 81.8% R-Sq(adj) 81.3% ISNT mg NH4-N/kg

4 Net Mineralisation mgnh4-n/kg Fitted Line Plot Net Mineralisation mgnh4-n/kg = SAND % S R-Sq 11.5% R-Sq(adj) 8.9% SAND % Given the low S value and the obvious high R 2 value observed when using ISNT as a predictor, a decision could be made at this point to potentially restrict the analysis to simple linear regression. However, from a scientific knowledge perspective and given the varying soil compositions in Ireland, the negative correlation with % Sand triggered the question whether soil drainage might actually have any influence on the model? Analysis 2 To test the above hypothesis a new Binary Predictor was created DrainageBIN using calc to transform a categorical variable Soil Drainage type into an indicator variable containing the correct binary value for each observation: Freely drained (), Poorly Drained (1). The resulting regression equation is Net Mineralisation (mgnh4-n kg-1) = ISNT (mg NH4-N kg-1) DrainageBIN Constant ISNT (mg NH 4 -N kg-1) DrainageBIN S = R-Sq = 85.1% R-Sq(adj) = 84.3% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS % Explained ISNT mg NH 4 -N/kg DrainageBIN

5 What is the above output showing us? Both ISNT and DrainageBIN have positive relationships showing that increasing values yield greater Net Mineralization. The poorly drained soils may have lower levels of SOM loss due to lower decomposition and therefore higher potential to release more ISNT-N and ultimately have higher Net Mineralisation levels. The Sequential Sum of Squares (Seq SS) illustrates that ISNT explains ~82% of unique variation in the model with DrainageBIN explaining only ~3%. Analysis 3 The next step in the analysis is to include all 3 significant predictor variables. The ISNT will give information of the easily mineralised N components in the soil. The Sand % will give us information on the levels of SOM in these soils and finally the drainage class will add information on the state of decomposition of the SOM. Included ISNT, % Sand & New DrainageBin The regression equation is Net Mineralisation mgnh4-n/kg = ISNT mg NH4-N/kg DrainageBIN SAND % Constant ISNT mg NH 4 -N/kg DrainageBIN SAND % S = R-Sq = 87.5% R-Sq(adj) = 86.4% is this because we are increasing the #Variables? 4

6 Frequency Percent During the initial analysis when looking at the Versus Fits diagram, 2 values appeared suspicious (circled below in green) Plots for Net Mineralisation mgnh4-n/kg 99 9 Normal Probability Plot Versus Fits Fitted Value 6 Histogram Versus Order Observation Order 3 35 Further analysis of the data contained in Rows 3&4 resulted in identifying the samples as histosol soils which are soils that are shallow with poorly aerated organic material, i.e. not mineral soils like the remaining 35 so they were removed. Rerunning ISNT and %Sand without these 2 outliers the regression equation is Net Mineralisation mgnh4-n/kg = ISNT mg NH4-N/kg SAND % Constant ISNT mg NH 4 -N/kg SAND % S = R-Sq = 71.1% R-Sq(adj) = 69.3% By removing the 2 outliers % Sand is no long statistically significant with a P-value above.5. Bringing DrainageBIN back with ISNT Regression Analysis: Net Mineralisation versus ISNT mg, & DrainageBIN The regression equation is Net Mineralisation mgnh4-n/kg = ISNT mg NH4-N/kg DrainageBIN Constant ISNT mg NH 4 -N/kg DrainageBIN

7 S = R-Sq = 72.2% R-Sq(adj) = 7.5% Adding DrainageBIN into the equation along with ISNT shows that Drainage is statistically significant. This makes scientific sense as discussed before so the decision is to include this, at least for now, until further laboratory studies can reinforce or disprove its appropriateness. At this point it would be tempting to include both Drainage and % Sand but as they are measuring effectively the same textural variable and it would potentially be falsely increasing the R-Sq(adj) to 72.1% (analysis not shown). Regression Equation: Y = α+ β 1 X 1 + β 2 X 2 +/-ε where ε=2s For illustration purposes of a soil not yet sampled where X 1 = 2 X 2 = 1 (Poorly Drained) Y = (2) (1) +/- 2( ) Y = / Section 3 Criticism As the above Prediction Interval is large the model is (in its current state) unsuitable for immediate development of N fertiliser recommendation for farms, despite the satisfactory adjusted R 2 of 7.5% which was yielded. Further exploration of the Drainage, %Sand, soil texture is required at the laboratory/field plot level to cement findings. This analysis represented in this study is a subset of a much larger yet fully unexplored dataset. Adding further variables to the model may well enhance the overall outcome of the model. A larger dataset (currently in progress) would perhaps show a reduction in the variance. 6