Using R for Introductory Statistics

Similar documents
Lab 1: A review of linear models

Subscribe and Unsubscribe: Archive of past newsletters

STATISTICS PART Instructor: Dr. Samir Safi Name:

Problem Points Score USE YOUR TIME WISELY SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

NEW CURRICULUM WEF 2013 M. SC/ PG DIPLOMA IN OPERATIONAL RESEARCH

4.3 Nonparametric Tests cont...

Loblolly Example: Correlation. This note explores estimating correlation of a sample taken from a bivariate normal distribution.

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Modelling categorical variables using logit models

Statistical Modelling for Business and Management. J.E. Cairnes School of Business & Economics National University of Ireland Galway.

Who Are My Best Customers?

Oneway ANOVA Using SAS (commands= finan_anova.sas)

Discussion Solution Mollusks and Litter Decomposition

Introduction of STATA

A FIRST COURSE IN STATISTICAL PROGRAMMING WITH R BY W. JOHN BRAUN, DUNCAN J. MURDOCH

Dealing with Missing Data: Strategies for Beginners to Data Analysis

Getting Started with HLM 5. For Windows

Preface. R is a program for statistical analysis and graphical display of data.

IBM SPSS Statistics: What s New

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

SCENARIO: We are interested in studying the relationship between the amount of corruption in a country and the quality of their economy.

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Stats Happening May 2016

Statistics 201 Summary of Tools and Techniques

Tutorial Regression & correlation. Presented by Jessica Raterman Shannon Hodges

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

Table. XTMIXED Procedure in STATA with Output Systolic Blood Pressure, use "k:mydirectory,

Hearst Challenge : A Framework to Learn Magazine Sales Volume

Indian Institute of Technology Kanpur National Programme on Technology Enhanced Learning (NPTEL) Course Title Marketing Management 1

Exploring Supply Dynamics in Competitive Markets

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

Research in Applied Econometrics Chapter 0. Organization

Center for Demography and Ecology

statistical computing in R

Learn What s New. Statistical Software

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

5 CHAPTER: DATA COLLECTION AND ANALYSIS

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

I am an experienced SAS programmer but I have not used many SAS/STAT procedures

Statistics For Managers Using Ms Excel 6th Edition

Soci Statistics for Sociologists

iq-training 2014 Course Catalog

Brian Macdonald Big Data & Analytics Specialist - Oracle

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

More Multiple Regression

Model construction of earning money by taking photos

Categorical Variables, Part 2

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

Attachment 1. Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006

One of the "big beasts" of statistical computing capable of much more than basic use.

Day 1: Confidence Intervals, Center and Spread (CLT, Variability of Sample Mean) Day 2: Regression, Regression Inference, Classification

Olin Business School Master of Science in Customer Analytics (MSCA) Curriculum Academic Year. List of Courses by Semester

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian. Preliminary Data Screening

R and R Studio Lab #1. Introductions & Basics February 2017

Metaheuristics. Approximate. Metaheuristics used for. Math programming LP, IP, NLP, DP. Heuristics

AP Statistics Scope & Sequence

Department of Economics, University of Michigan, Ann Arbor, MI

WINDOWS, MINITAB, AND INFERENCE

Bayes for Undergrads

Business Data Analytics

Estimation of haplotypes

Office hours: E62-540, by appointment Teaching Fellow: TIanzhou Duan, E62-584,

BUS105 Statistics. Tutor Marked Assignment. Total Marks: 45; Weightage: 15%

Using SPSS for Linear Regression

Data from a dataset of air pollution in US cities. Seven variables were recorded for 41 cities:

Examples of Statistical Methods at CMMI Levels 4 and 5

Professional Ethics and Organizational Productivity for Employee Retention

ICTCM 28th International Conference on Technology in Collegiate Mathematics

Development of the Project Definition Rating Index (PDRI) for Small Industrial Projects. Wesley A. Collins

The Dummy s Guide to Data Analysis Using SPSS

TNM033 Data Mining Practical Final Project Deadline: 17 of January, 2011

AN EMPIRICAL STUDY OF QUALITY OF PAINTS: A CASE STUDY OF IMPACT OF ASIAN PAINTS ON CUSTOMER SATISFACTION IN THE CITY OF JODHPUR

FEATURES AND BENEFITS

PSC 508. Jim Battista. Dummies. Univ. at Buffalo, SUNY. Jim Battista PSC 508

Choosing Statistical Software

LECTURE 17: MULTIVARIABLE REGRESSIONS I

Republic of the Philippines LEYTE NORMAL UNIVERSITY Paterno St., Tacloban City. Course Syllabus in FD nd Semester SY:

SAS/STAT 13.1 User s Guide. Introduction to Multivariate Procedures

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

Sensitivity of Technical Efficiency Estimates to Estimation Methods: An Empirical Comparison of Parametric and Non-Parametric Approaches

Using Excel s Analysis ToolPak Add In

Hardness Example: Tukey s Multiple Comparisons

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS

Introduction to Multivariate Procedures (Book Excerpt)

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

ITEM System Usage in the Ministry of Education in Botswana

Europ. Pharm., 5th Ed. (2005), Ch. 5.3, A completely randomised (0,4,4,4)-design - An in-vitro assay of influenza vaccines

STAT 2300: Unit 1 Learning Objectives Spring 2019

Downloaded from: Usage Guidelines

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, Last revised March 28, 2015

Big Data Executive Program

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Timing Production Runs

Create a Measurement with BenchVue & Mass Data Analysis Software. Moon Jong-Won

Statistics: Data Analysis and Presentation. Fr Clinic II

Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function

ADVANCED DATA ANALYTICS

HEA 2017 Conference: Exploring Innovations in Assessment with Statistical and Data Analytical Software Packages Kathy Maitland Peter Samuels

Transcription:

R http://www.r-project.org Using R for Introductory Statistics John Verzani CUNY/the College of Staten Island January 6, 2009 http://www.math.csi.cuny.edu/verzani/r/ams-maa-jan-09.pdf John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 1 / 28

R Computers and statistics Statistics and Computers The field of statistics has been revolutionized by computers Access to large data sets Simulation studies Robust statistics,... Computational statistics: iterative algorithms, bootstrap, MCMC,... Pedagogy could catch up, but for now introductory statistics: Focuses on summary statistics now easy to compute Focuses on normal approximations to produce inference. Focuses on linear models (simple regression, ANOVA) Some focus on simulations John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 3 / 28

R Different technology options Some technology solutions for teaching statistics As of now, no consolidation in statistics software Calculators easy, reliable, students like; data sets, doesn t grow Excel ubiquitious, familiar, great for data manipulation; not always right!, programming is tedious,... GUI driven for students: Fathom, JMP, DataDesk; easy to use, don t grow so well GUI driven, commandline: Stata, SPSS, SAS, MINITAB; widely used, programming tedious. Command line: R (S-Plus); widely used, geared toward extending language. John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 4 / 28

Why use R in an Introductory class? R Different technology options www.r-project.org Open source, multi-platform, statistical computing environment: Used by many academics, businesses worldwide A programming language (similar to S-Plus) geared toward statistical usage: pre-programmed functions for common things. Command line interface (CLI) encourages computational literacy (some GUIs exist) John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 5 / 28

R Some things to consider Using R with introductory statistics Things to consider Have clear learning goals for use: statistical understanding through examples to forced computational literacy many target populations for introductory statistics GUIs make R easy to use to get at statistical questions (http: //www.amstat.org/publications/jse/v16n1/verzani.html) but currently lack the polish of professional packages. (Rcmdr, pmg, RKWard) CLI is harder to teach, but R does not have a difficult syntax to learn. Introductory statistics is fairly finite. Never enough time, hard for all students to learn independently Success depends on students students learn at different rates John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 6 / 28

Some Examples, load data Examples Data Load a data set > x <- c(1.2, 2.1, 3.3) # type in -- tedious for students > ## built in data sets (from a package in this case) > library(mass) > data(cars93) > ## output from excel as csv > x <- read.csv("test.csv", header=false) > ## tab separated > x <- read.table("test.txt", header=true) > ## or download from a website > x <- source("http://wiener.math.csi.cuny.edu/st/r/diet.r") John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 8 / 28

Numeric summaries Examples Data Summary statistics > mean(cars93$mpg.highway) # need variable name [1] 29.08602 Tables > xtabs( ~ Origin + Type, data=cars93) Type Origin Compact Large Midsize Small Sporty Van USA 7 11 10 7 8 5 non-usa 9 0 12 14 6 4 John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 10 / 28

EDA: Histograms Examples EDA Graphics > library(lattice) > histogram(~ MPG.highway, data = Cars93) Percent of Total 30 20 10 0 20 30 40 50 MPG.highway John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 12 / 28

Density plots Examples EDA Graphics > densityplot(~ MPG.highway, data = Cars93) 0.08 Density 0.06 0.04 0.02 0.00 20 30 40 50 MPG.highway John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 14 / 28

Multivariate Multivariate graphics Examples EDA Graphics > bwplot(mpg.highway ~ Origin, data = Cars93) 50 45 MPG.highway 40 35 30 25 20 USA non USA John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 16 / 28

Examples Inference T-tests. (Ignore lack of random sampling!) > t.test(mpg.highway ~ Origin, data = Cars93) Welch Two Sample t-test data: MPG.highway by Origin t = -1.7545, df = 75.802, p-value = 0.08339 alternative hypothesis: true difference in means is not equal 95 percent confidence interval: -4.1489029 0.2627918 sample estimates: mean in group USA mean in group non-usa 28.14583 30.08889 John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 18 / 28

Permutation tests Examples Inference > library(coin) ## external package -- *lots* of them > wilcox_test(mpg.highway ~ Origin, data = Cars93) Asymptotic Wilcoxon Mann-Whitney Rank Sum Test data: MPG.highway by Origin (USA, non-usa) Z = -1.3109, p-value = 0.1899 alternative hypothesis: true mu is not equal to 0 John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 20 / 28

Scatterplots Examples Linear models > plot(mpg.highway ~ Weight, data=cars93) MPG.highway 20 30 40 50 2000 2500 3000 3500 4000 Weight John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 22 / 28

Regression Examples Linear models > res <- lm(mpg.highway ~ Weight, data = Cars93) > summary(res) Call: lm(formula = MPG.highway ~ Weight, data = Cars93) Residuals: Min 1Q Median 3Q Max -7.65007-1.83591-0.07741 1.82353 11.61722 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 51.6013654 1.7355498 29.73 <2e-16 *** Weight -0.0073271 0.0005548-13.21 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 24 / 28

Examples Simulations Simulations: for loops are one approach > res <- c(); n <- 5 ## initialize > for(i in 1:1000) { + x <- rexp(n) - 1 + res[i] <- (mean(x) - 1)/(sd(x)/sqrt(n)) + } > qqmath(~ res, distribution = function(p) qt(p, df = n-1)) res 10 0 20 30 40 50 John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 26 / 28

Materials Some thoughts Students need materials to follow, often these are written by the instructor Examples should be driven by material not computer use. Examples should tell a story. Students learn at different rates materials should keep them all busy Use large, real data sets or do problems that can t otherwise be done Do share on the internet what you create Avoid complication at the expense of simplicity (at times): Focus on easy before hard: for loops before vectorized approaches (in R apply functions, sage list comprehensions) Computer languages are hard to learn, but that part is easy to forget. Functional programming easier to get across than OOP Classes are needed by the developers more so than the users John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 27 / 28

Be aware that... Some thoughts CLI + syntax has no common metaphor (email, text message,...); Source files, editors have no common metaphor Colleagues learn slower than students Adoption within dept. hard (standarization can be hard) It can be hard for students to conceptually relate computational result with question Many students aren t motivated by cost/availability of software John Verzani (CSI) Using R for Introductory Statistics January 6, 2009 28 / 28