Checking data for outliers: Few data points, tolerance tables. 7 th Seminar on Statistics in Seed testing. Gregoire, Laffont, Remund

Size: px
Start display at page:

Download "Checking data for outliers: Few data points, tolerance tables. 7 th Seminar on Statistics in Seed testing. Gregoire, Laffont, Remund"

Transcription

1 Checking data for outliers: Few data points, tolerance tables 7 th Seminar on Statistics in Seed testing Gregoire, Laffont, Remund

2 Overview Check a large number of data Routine high throughput Methodological studies Quality check on a production line Yearly a posteriori check of apparatus/analysts Check data for a single result To secure the result To assess mean estimate and uncertainty ISTA Statistics Committee 2

3 General ideas Min, Max, expected values (big mistake, typing error, ) Compute variability (and compare or use) Represent graphically the data points Perform appropriate statistical tests Compare to reference tables Double-analysis, make repeats, (check effects) Check the work, complement or re-test if necessary Delete erroneous data point(s) ISTA Statistics Committee 3

4 x and s charts for germination test Calculate the average germination over four 100 seed reps for each check sample Plot these means over time The plot centerline is the grand average (average of the averages for stable samples) Upper and lower control limits are essentially +/- 3 standard deviations from the centerline Similar logic is used to construct the s chart ISTA Statistics Committee 4

5 Upper Control Limit (UCL) 93 x and s chart example X-bar Chart Centerline Lower Control Limit (LCL) 73 8/2/2004 8/9/2004 8/13/2004 8/19/2004 8/20/2004 8/23/2004 8/27/2004 8/30/ s Chart /2/2004 8/9/2004 8/13/2004 8/19/2004 8/20/2004 8/23/2004 8/27/2004 8/30/2004 ISTA Statistics Committee 5

6 Use variability Using previous information from a lot of comparable tests standard deviation (s) coefficient of variation (cv) Using only the present data points standard deviation (s) coefficient of variation (cv) s = standard deviation cv = coefficient of variation RMSE =Root Mean-Square Error population % Bias = % RMSE = % CV = sample x μ 100 μ 100 ( x μ) 100 ISTA Statistics Committee 6 n i= 1 s x i n μ NB: Bias and RMSE can also be used when the true value ( μ) is known (proficiency test, calibration, quality control, ) 2

7 Compare variability to limits Compute usual standard deviation and cv, if s or cv are much bigger (much smaller) than usually, alarm->action. If 2 points it is the only possibility 3 points of more, simple rules or graphics Range (max-min) Difference to the mean of all or some points in % Box plots Histograms, statistical tests, ISTA Statistics Committee 7

8 Data checking by tolerance tables For a number of tests, ISTA has developed tolerance tables. They give a maximum tolerated range of variation between different results. If the difference is greater to the tolerance, appropriate action is needed (re-test for instance) ISTA Statistics Committee 8

9 Example: Germination tests Check a result Check result for equivalence to another Check if new result is below first result A different question: Uncertainty of the result ISTA Statistics Committee 9

10 Example: Germination The maximum tolerated range among 4 replicates of 100 seeds in a germination test could be found in Table 5.1. of the ISTA Rules. This table contains the maximum tolerated difference between the highest and lowest value of the germination percentage. at 2.5% significance level based on a two-sided test ISTA Statistics Committee 10

11 Example on germination, check a result Average percentage Maximum Average percentage Maximum germination range germination range to to to to to to to to to to to 94 7 to to to to 92 9 to to to to to to to Germination test on four 100-seed sub-samples of Hordeum vulgare First sub-sample: 82% Second sub-sample: 90% Third sub-sample: 89% Fourth sub-sample: 95% Average percentage: 89% ( )/4 Maximum difference: 13% (95-82) Tolerated diff. 12% in Table 5.1. Decision: the values are out of tolerance ISTA Statistics Committee 11

12 Expected variability has to be assessed For germination variability due to random variation is used: Binomial distribution percentage For other species determination variability due to random variation is used: Poisson distribution (rare events) For Purity a list of factors have been recognized and estimated on actual situations (between bags, within bags, working sample, between analysts, within analyst) For Tetrazolium and vigour the overall existing variability has been evaluated and taken into account Etc.. For GM testing there are no tolerance tables (yet) ISTA Statistics Committee 12

13 Data checking is good practice If we would have only one batch of 400 seeds and one result, the check would not be possible. The reason to make repeats can be the will to check, practical reasons (boxes can not contain all seeds, repeats should have similar growing conditions but we want to check it ) Some labs perform tests by subdividing in two analysts, to check compatibility test by test, and on a long term basis ISTA Statistics Committee 13

14 ISTA Out of tolerance table = alert In all cases the idea is to evaluate the variability that is usually encountered in good conditions, and to derive the maximum difference between values in for instance 95% of the situations. Above is an alert, and appropriate action is to be done (checks, make a new test, a complementary test, ) NB: If the tolerance table is designed to point out 5% of the tests in usual situations, it is expected that about 5% of the tests will raise an alert in routine ISTA Statistics Committee 14

15 In case needed ISTA Statistics Committee 15