INTRODUCTION TO STATISTICS

Size: px
Start display at page:

Download "INTRODUCTION TO STATISTICS"

Transcription

1 INTRODUCTION TO STATISTICS Slides by Pierre Dragicevic

2 WHAT YOU WILL LEARN Statistical theory Applied statistics This lecture

3 GOALS Learn basic intuitions and terminology Perform basic statistical inference with R Focus on high-level aspects Accent on estimation rather than hypothesis testing ("the New Statistics")

4 ORGANIZATION Part I - Elementary notions Part II - Tutorial with R Part III - Assignments

5 A DEFINITION Statistics is the study of the collection, analysis, interpretation, presentation and organization of data. Dodge, Y. (2006) The Oxford Dictionary of Statistical Terms, OUP.

6 ORIGINS 1750s German Statistik analysis of data about the state Quickly adopted in England (previously called political arithmetics ) Achenwall, Gottfried: Abriß der neuesten Staatswissenschaft der vornehmsten Europäischen Reiche und Republicken. Göttingen, 1749

7 ORIGINS John Graunt, 1662 Observations on the bills of mortality

8

9

10 ORIGINS John Graunt, 1662 Observations on the bills of mortality First life tables Dispelled several myths about the plague First analysis of sex ratio First realistic estimate of the population in London

11 ORIGINS Prompted collection of more data Parallel developments in probability theory Statistics then developed into a more rigorous discipline and was applied to: Business & industry Medicine Science...

12 STATS & VISUALIZATION Statistical Charts William Playfair

13 STATS & VISUALIZATION Exploratory Data Analysis Tukey, 1977

14

15

16 Statistical Graphics AT&T Bell Labs Video, 1985

17

18

19

20 German bombings in London during WWII

21 German bombings in London during WWII

22 German bombings in London during WWII

23 STATS & VISUALIZATION Confirmatory data analysis For answering questions rigorously Example: is this new drug effective? Strong focus on automatic procedures, computation and objectivity Looking at data can impair objectivity: Cherry picking, snooping, fishing, data mining

24 German bombings in London during WWII

25 Exploratory analysis and confirmatory analysis can and should proceed side by side (Tukey; 1977). Quoted from the SAS Institute STATS & VISUALIZATION Exploratory data analysis is sometimes compared to detective work: it is the process of gathering evidence. Confirmatory data analysis is comparable to a court trial: it is the process of evaluating evidence.

26 WHAT ARE STATS? A set of tools and methods With an old tradition: Origins in demographics Anchored in mathematics & probability theory Visual representations play a role A generally strong focus on (computationally cheap) numerical calculations

27 WHAT ARE STATS? Good for: - Summarizing data for presentation - Answering questions rigorously - Making predictions - Making rational, evidence-based decisions - A long accumulated experience!

28 STATISTICAL TOOLS

29 STATISTICAL TOOLS DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS

30 STATISTICAL TOOLS DESCRIPTIVE STATISTICS

31 AN EXAMPLE Selling encyclopedias Robert Steve Paul Roger Geoffrey Dan

32

33

34 Seller 1 Seller 2 Seller 3 Seller 4 Seller 5 Seller 6

35 Seller 1 Seller 2 Seller 3 Seller 4 Seller 5 Seller 6

36 CENTRAL TENDENCY From Kalid Azad

37 CENTRAL TENDENCY Mode Median day Sales in euro

38 CENTRAL TENDENCY When are the mean and the median equal? When do they differ?

39 CENTRAL TENDENCY negative skew symmetric positive skew From Shreya Sethi

40 CENTRAL TENDENCY

41 CENTRAL TENDENCY

42 CENTRAL TENDENCY From Shreya Sethi

43 CENTRAL TENDENCY What is the best measure of central tendency? Income

44 DISPERSION Standard Deviation Image from Wikipedia

45 Source unknown DEPENDENCE Correlation

46 DEPENDENCE Correlation Image from Wikipedia

47 Seller 2 DEPENDENCE Correlation r = Seller 1

48 Seller 1 Seller 2 Seller 3 Seller 4 Seller 5 Seller 6

49 Seller 1 Seller 2 Seller 3 Seller 4 Seller 5 Seller 6

50

51

52 September 2016 How much can we trust this chart?

53 LET US TRAVEL TO THE FUTURE

54 September 2016

55 October 2016

56 November 2016

57 December 2016

58 September 2016

59 October 2016

60 November 2016

61 December 2016

62 BACK TO THE PRESENT

63 September 2016

64 September 2016 How much can we trust this chart?

65 STATISTICAL TOOLS INFERENTIAL STATISTICS

66 STATISTICAL INFERENCE

67 STATISTICAL INFERENCE

68 STATISTICAL INFERENCE Terminology: Sample vs. population Mean, median, standard deviation, correlation, etc: A sample statistic A population parameter

69 STATISTICAL INFERENCE Unit of statistical analysis = the thing that I m sampling from a larger population"

70 STATISTICAL INFERENCE Unit of statistical analysis

71 STATISTICAL INFERENCE Unit of statistical analysis

72 STATISTICAL INFERENCE Unit of statistical analysis

73 STATISTICAL INFERENCE Unit of statistical analysis

74 STATISTICAL INFERENCE Unit of statistical analysis

75 SAMPLING DISTRIBUTION

76 SAMPLING DISTRIBUTION "The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. [ ] It may be considered as the distribution of the statistic for all possible samples from the same population of a given size"

77 SAMPLING DISTRIBUTION From Lisa Sullivan

78 SAMPLING DISTRIBUTION Demo

79 SAMPLING DISTRIBUTION

80 SAMPLING DISTRIBUTION Standard error 95% confidence interval

81 SAMPLING DISTRIBUTION

82 SAMPLING DISTRIBUTION Resampling techniques Bootstrapping

83 From Germain Salvato-Vallverdu SAMPLING ERROR Bootstrapping

84 From Germain Salvato-Vallverdu SAMPLING ERROR Bootstrapping

85 From Germain Salvato-Vallverdu SAMPLING ERROR Bootstrapping

86 From Germain Salvato-Vallverdu SAMPLING ERROR Bootstrapping resample with replacement

87 From Germain Salvato-Vallverdu SAMPLING ERROR Bootstrapping

88 From Germain Salvato-Vallverdu SAMPLING ERROR Bootstrapping

89 SAMPLING DISTRIBUTION Bootstrapping video

90 SAMPLING DISTRIBUTION How did people this do before computers?

91 MORE HISTORY Abraham De Moivre

92 MORE HISTORY Abraham De Moivre

93 MORE HISTORY Abraham De Moivre

94 MORE HISTORY Abraham De Moivre

95 MORE HISTORY Abraham De Moivre

96 MORE HISTORY Abraham De Moivre

97 MORE HISTORY

98 NORMAL DISTRIBUTION

99 NORMAL DISTRIBUTION Sir Francis Galton Bean Machine or Galton Board:

100 NORMAL DISTRIBUTION Central Limit Theorem Given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed

101 NORMAL DISTRIBUTION Exact Confidence Intervals t ~ 1.96 for large samples

102 CONFIDENCE INTERVALS

103 CONFIDENCE INTERVALS M margin of error = length of blue line

104 CONFIDENCE INTERVALS M 95% confidence interval

105 tinyurl.com/danceptrial2 μ Different random samples

106 CONFIDENCE INTERVALS Several interpretations «a range of plausible values for μ. Values outside the CI are relatively implausible.» (Cumming and Finch, 2005) Examples of presentation formats: 2.2m, 95% CI [1.6m, 2.8m] 2.2m +/- 0.6m from 1.6m to 2.8m

107 CONFIDENCE INTERVALS «a range of plausible values for μ. Values outside the CI are relatively implausible.» (Cumming and Finch, 2005) Pill A Pill B weight loss

108 CONFIDENCE INTERVALS «a range of plausible values for μ. Values outside the CI are relatively implausible.» (Cumming and Finch, 2005) Pill A Pill B weight loss

109 CONFIDENCE INTERVALS «a range of plausible values for μ. Values outside the CI are relatively implausible.» (Cumming and Finch, 2005) A - B A - C A - D diff. in weight loss

110 CONFIDENCE INTERVALS values close to our M are the best bet for μ, and values closer to the limits of our CI are successively less good bets. (Cumming, 2013)

111 BE VAGUE IN YOUR DESCRIPTION A - B A - C A - D diff. in weight loss the figure provides good evidence that B outperforms A, whereas C and A seem very similar, and results are largely inconclusive concerning the difference between D and A.

112 PUBLISHED EXAMPLE

113 PUBLISHED EXAMPLE

114 BACK TO OUR EXAMPLE Selling encyclopedias

115

116

117

118

119