Bivariate Data Notes

Similar documents
10.2 Correlation. Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot.

Quadratic Regressions Group Acitivity 2 Business Project Week #4

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches

y x where x age and y height r = 0.994

1. Contingency Table (Cross Tabulation Table)

Week 4 Lecture 10 We have been examining the question of equal pay for equal work for several weeks now; but have been somewhat frustrated with the

SPSS Guide Page 1 of 13

Aug 1 9:38 AM. 1. Be able to determine the appropriate display for categorical variables.

What proportion of the items provide enough information to show that they used valid statistical methods?

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

9.7 Getting Schooled. A Solidify Understanding Task

Online Student Guide Scatter Diagrams

Find the sum of ALL of the terms in this sequence:

AP Statistics Scope & Sequence

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Section 9: Presenting and describing quantitative data

Core vs NYS Standards

CHAPTER 1 A CASE STUDY FOR REGRESSION ANALYSIS

STATISTICAL TECHNIQUES. Data Analysis and Modelling

Distinguish between different types of numerical data and different data collection processes.

Chapter 10 Regression Analysis

A Research Note on Correlation

An ordered array is an arrangement of data in either ascending or descending order.

Statistical Analysis. Chapter 26

The Dummy s Guide to Data Analysis Using SPSS

Creative Commons Attribution-NonCommercial-Share Alike License

STAT 2300: Unit 1 Learning Objectives Spring 2019

Los Angeles Unified School District. Student Test Booklet

Business: Sales and Marketing Crosswalk to AZ Math Standards

2.3 & 2.4 Scatter Plots and Trends in Data

Descriptive Statistics Tutorial

Clovis Community College Class Assessment

Comparison of Efficient Seasonal Indexes

Name(s) (1) that most food is produced by burning fossil fuels?

SCENARIO: We are interested in studying the relationship between the amount of corruption in a country and the quality of their economy.

CHAPTER 2: ORGANIZING AND VISUALIZING VARIABLES

Missouri Standards Alignment Grades One through Six

What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods. What is Multivariate Analysis? Computing. Some Quotes.

PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT

Forecasting for Short-Lived Products

Multiple Choice (#1-9). Circle the letter corresponding to the best answer.

Making Sense of Data

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data

CHAPTER 4, SECTION 1

Ch. 7 outline. 5 principles that underlie consumer behavior

Statistical Pay Equity Analyses: Data and Methodological Overview

Business Math Curriculum Guide Scranton School District Scranton, PA

SPSS 14: quick guide

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Semester 2, 2015/2016

Glossary of Standardized Testing Terms

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Density Measurements Background

Water Resources Engineering. Prof. R. Srivastava. Department of Water Resources Engineering. Indian Institute of Technology, Kanpur.

Equating and Scaling for Examination Programs

DETERMINING THE DENSITY OF LIQUIDS & SOLIDS

Math 1 Variable Manipulation Part 8 Working with Data

Math 1 Variable Manipulation Part 8 Working with Data

Application Forms Guide

Forecasting Introduction Version 1.7

A01 325: #1 VERSION 2 SOLUTIONS

Tools and features used in a spreadsheet

CHAPTER 2: ORGANIZING AND VISUALIZING VARIABLES

Business plan readiness assessment

Indian Institute of Technology Kanpur National Programme on Technology Enhanced Learning (NPTEL) Course Title Marketing Management 1

Test lasts for 120 minutes. You must stay for the entire 120 minute period.

Chapter 9. Regression Wisdom. Copyright 2010 Pearson Education, Inc.

Daily Optimization Project

Exam 1 - Practice Exam (Chapter 1,2,3)

LITERATURE REVIEW COMPARATIVE ANALYSIS OF PARLIAMENTARY IMPACT IN THE LEGISLATIVE PROCESS

CHAPTER 2: ORGANIZING AND VISUALIZING VARIABLES

Urban Transportation Planning Prof Dr. V. Thamizh Arasan Department of Civil Engineering Indian Institute Of Technology, Madras

Two-Way Tables ESSENTIAL QUESTION. How can you use two-way frequency tables to solve real-world problems? Real-World Video. my.hrw.

Don't Build That Survey Yet: Ask These 10 Questions First Summit 2017 Carol Haney

MAS187/AEF258. University of Newcastle upon Tyne

Module 55 Firm Costs. What you will learn in this Module:

Environmental Studies. Course Code Number and Abbreviation. Course Description. Instructional Units and Pacing Plans

2.7 Calibrating soil water monitoring devices

Physics 141 Plotting on a Spreadsheet

Econometric Analysis Dr. Sobel

AS MATHEMATICS. Paper 2 PRACTICE PAPER SET1

AGENDA Tues 1/26 & Wed 1/27

Non-academic applications. Non-academic CVs

Introduction to System Dynamics

Displaying Bivariate Numerical Data

Identification Label. Student ID: <TIMSS National Research Center Name> <Address> Student Name: Questionnaire. <Grade 8>

Non-academic applications

Bioreactors Prof G. K. Suraishkumar Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 02 Sterilization

CHAPTER 4. Labeling Methods for Identifying Outliers

Strong Interest Inventory Certification Program Program Pre-Reading Assignment

CE 115 Introduction to Civil Engineering Graphics and Data Presentation Application in CE Materials

Exploring Quadratic Relations Unit 5 Lesson 1

Important definitions and helpful examples related to this project are provided in Chapter 3 of the NAU MAT 114 course website.

Lesson 1: Exploring Quadratic Functions Unit 6 - Quadratic Functions

Identification Label. Student ID: <TIMSS National Research Center Name> <Address> Student Name: Questionnaire. (Separate Science Subjects) <Grade 8>

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning

Modeling Using Exponential Functions

Selecting a Study Site

Rounding a method for estimating a number by increasing or retaining a specific place value digit according to specific rules and changing all

CHAPTER ONE: OVEVIEW OF MANAGERIAL ACCOUNTING

Transcription:

Bivariate Data Notes Like all investigations, a Bivariate Data investigation should follow the statistical enquiry cycle or PPDAC. Each part of the PPDAC cycle plays an important part in the investigation and for the sake of convenience and assessment restrictions the starting point will be the first P, Problem. From here the rest of the investigation should follow ending with C, Conclusion, which should sum up the findings and give a response to the Problem identified at the start. Problem This section will define the investigative problem and lead the student to look into relationships between variables of choice. This is possibly the most important component of the investigation. Time spent on this component can determine the overall quality of the investigation. This component provides an opportunity to show justification (M) and statistical insight (E). Before writing this component of the investigation, some of the variables may need to be researched to find the precise meaning. When selecting the variables to investigate, careful consideration needs to be done to ensure you are looking for a potential causal relationship. An example of a problem for Achieved level responses could look like this: The purpose of this investigation is to investigate how well an athlete s BMI can be used to predict their percentage body fat. The data was supplied. When carrying out the investigation, the context of the problem should be well established and kept to the forefront of all discussion points. Some initial research could drive the production of the investigative problem. Comparisons can be alluded to and underlying variables could be discussed. All of these variations can lead to the above statement becoming suitable for a Merit or an Excellence investigation. An example of a problem for Merit level responses could look like this: The purpose of this investigation is to investigate if an athlete s BMI or their sum of skin folds is better used to predict their percentage body fat and to see if this is different depending

on the gender of the athlete. The data used in the investigation was supplied and it came from the Australian Institute of Sport. Here there is a definite look to compare two different control (independent) variables to see their effect on the response (dependent) variable. There is also a look to investigate subsets with in each control variable to see if this gives a different conclusion. It is worth noting at this point that the investigative question should be looking at variables that could potentially have a causal effect on each other. Asking if height was a good predictor of percentage body fat makes no sense as by making an athlete taller will not cause them to have a higher (or lower) percentage body fat reading. What might an excellence problem look like? It would be based on research that will be quoted throughout the investigation. It would look something like this: The purpose of this investigation is to look into a claim that was found in [insert reference 1 here]. This source stated that an athlete s BMI can be safely used to predict a person s percentage body fat. This report will look into whether this holds for athletes and it will also compare this to the sum of skin folds and its ability to predict an athlete s percentage body fat. Interestingly [insert reference 2 here] go on to say that BMI is a better predictor of percentage body fat in female subject, so this investigation will look to see if this is also true when looking at the gender of an athlete for both BMI and the sum of skin folds. The supplied data used in this investigation came from the Australian Institute of Sport. It includes data about 102 male athletes and 100 female athletes. Remember these are all just examples. So long as the purpose of the investigation is clear and the variables of interest have been clearly identified. Plan This section is where the process of the investigation is described. What will be done and what are the expected outcomes? This needs to be kept in

context and for this to count towards an M or an E grade then clear comparisons and research need to be linked into what is written. An example of a Plan for an Achieved level response could look like this: The computer software inzight is going to be used to produce the scatter plots for two different control variables against the same response variable. The equations will also be generated. The graphs will be used to choose the most valid model for predicting the response variable. The equation for this graph will then be used to make a prediction and a comment will be made to answer the investigative question. Data This section is where a description of the data is given. The extent of this description depends on whether the report is aimed at Achieved, Merit or Excellence. It is here that the data should be discussed including the use of correct units and a demonstration of understanding where the data has come from and what it means in terms of the context. Analysis A scatter plot is used to show how two variables are associated. If a population is being studied and in particular variable and are bing looked at, then each dot on the scatter plot represents the values and for an individual member of the population. The whole plot gives the visual representation of the entire sample.

A side note here, remember the names of variables are capital letters and a particular value of that variable is represented using the lower case version of the same letter. Unlike in a Time Series, the data points are not connected by line segments. Instead, when a pattern emerges in the placement of the data point, a line of best fit, or trend line is added. Usually you will fine ( ) on that line, where is the mean of the variable and is the mean of the variable. When analyzing a scatter plot, the mnemonic TARSOG will help to focus comments about specific features that are present. T A R S O G is for Trend, is it linear or something else? is for Association, is it positive or negative? is for Relationship, is it strong or weak? is for Scatter, is it constant or not? Fan shaped? is for Outliers, are any identifiable? is for Groups, are there any? This trend line (something inzight will produce) will be used later to make predictions of the response variable for particular values of the control variable. The fitting of a trend line initially is an arbitrary decision to choose it to be linear. The linear option is checked out first as it is the most simple and the easiest to interpret in context with any type of tangible meaning. The other options in inzight are quadratic (parabolic) and cubic. At this point it is a visual check as to the fit-ness of the model. Throughout the rest of this section, there are discussions that lead to evidence to support or reject the use of a linear trend line. The association of the data values looks at where there is a positive (as the control variable increases, so does the response variable) or a negative (as the control variable increases, the response variable decreases) association. When inzight gives the equation of this trend line it also produces another value it calls correlation. The correct name for this value is in fact the correlation coefficient and is often assigned the letter. The correlation coefficient can range in value from - 1 (a perfect negative association) through to 1 (a perfect positive association). This number allows the assignment of a description of the strength of the relationship.

As a general rule of thumb, these descriptions of the relationship present and values are acceptable: 1 0.75 0.5 0.25 0-0.25-0.5-0.75-1 Strong Moderate Weak None None Weak Moderate Strong Outliers are a big source of variation and need to be looked into carefully. There must be good reason to remove a value from a data set as the process can dramatically alter the relationship. Also, 2-3 would be an absolute maximum to remove, and usually the removal of one outlier is sufficient to see a change. There are two distinct types of outliers, ones that do not fit the pattern of the rest of the data (the left hand graph below has it circled) and the ones that fit the pattern but are a long way from the main data set ( the right hand graph has one of these).

Outliers When trying to identify potential outliers of the first type, residuals help a lot. Residuals are the distance from the raw data to the predicted data (or trend line). These need to be calculated and graphed to back up the selection of type 1 outliers. In some cases, when graphing the residuals, a pattern will emerge, this suggests that perhaps a linear model was not the best choice. A visual check of the linear trend line on the raw data will confirm this. The programme inzight only has the option of trying a quadratic or a cubic as curved models. Other software allows the user to look at logarithmic, power and exponential models also. Sometimes when plotting bivariate data, groupings become apparent in the data. These groupings can usually be explained by looking at a third variable. This third variable is commonly a categorical variable, hence it has the ability to segregate groups of data. Conclusion Predictions form part of the conclusion as they are used to help answer the investigative question this report started with. There are two different types of predictions that should be looked into, interpolations and extrapolations. Interpolations look at predictions that are within the range of x-values present in the sample and an extrapolation looks outside that range, above or below.

An appropriate evaluation of these predictions is required and leads onto the answer of the investigative question. When summing up in the conclusion, great care must be taken when making causal relationship statements. Careful analysis of potential underlying variables must have been done in order to improve the strength of argument for or against such a claim. Have other variables that could potentially influence the response variable been considered, rather than just looking for a straight predictive relationship. To be continued