Bivariate Data Notes Like all investigations, a Bivariate Data investigation should follow the statistical enquiry cycle or PPDAC. Each part of the PPDAC cycle plays an important part in the investigation and for the sake of convenience and assessment restrictions the starting point will be the first P, Problem. From here the rest of the investigation should follow ending with C, Conclusion, which should sum up the findings and give a response to the Problem identified at the start. Problem This section will define the investigative problem and lead the student to look into relationships between variables of choice. This is possibly the most important component of the investigation. Time spent on this component can determine the overall quality of the investigation. This component provides an opportunity to show justification (M) and statistical insight (E). Before writing this component of the investigation, some of the variables may need to be researched to find the precise meaning. When selecting the variables to investigate, careful consideration needs to be done to ensure you are looking for a potential causal relationship. An example of a problem for Achieved level responses could look like this: The purpose of this investigation is to investigate how well an athlete s BMI can be used to predict their percentage body fat. The data was supplied. When carrying out the investigation, the context of the problem should be well established and kept to the forefront of all discussion points. Some initial research could drive the production of the investigative problem. Comparisons can be alluded to and underlying variables could be discussed. All of these variations can lead to the above statement becoming suitable for a Merit or an Excellence investigation. An example of a problem for Merit level responses could look like this: The purpose of this investigation is to investigate if an athlete s BMI or their sum of skin folds is better used to predict their percentage body fat and to see if this is different depending
on the gender of the athlete. The data used in the investigation was supplied and it came from the Australian Institute of Sport. Here there is a definite look to compare two different control (independent) variables to see their effect on the response (dependent) variable. There is also a look to investigate subsets with in each control variable to see if this gives a different conclusion. It is worth noting at this point that the investigative question should be looking at variables that could potentially have a causal effect on each other. Asking if height was a good predictor of percentage body fat makes no sense as by making an athlete taller will not cause them to have a higher (or lower) percentage body fat reading. What might an excellence problem look like? It would be based on research that will be quoted throughout the investigation. It would look something like this: The purpose of this investigation is to look into a claim that was found in [insert reference 1 here]. This source stated that an athlete s BMI can be safely used to predict a person s percentage body fat. This report will look into whether this holds for athletes and it will also compare this to the sum of skin folds and its ability to predict an athlete s percentage body fat. Interestingly [insert reference 2 here] go on to say that BMI is a better predictor of percentage body fat in female subject, so this investigation will look to see if this is also true when looking at the gender of an athlete for both BMI and the sum of skin folds. The supplied data used in this investigation came from the Australian Institute of Sport. It includes data about 102 male athletes and 100 female athletes. Remember these are all just examples. So long as the purpose of the investigation is clear and the variables of interest have been clearly identified. Plan This section is where the process of the investigation is described. What will be done and what are the expected outcomes? This needs to be kept in
context and for this to count towards an M or an E grade then clear comparisons and research need to be linked into what is written. An example of a Plan for an Achieved level response could look like this: The computer software inzight is going to be used to produce the scatter plots for two different control variables against the same response variable. The equations will also be generated. The graphs will be used to choose the most valid model for predicting the response variable. The equation for this graph will then be used to make a prediction and a comment will be made to answer the investigative question. Data This section is where a description of the data is given. The extent of this description depends on whether the report is aimed at Achieved, Merit or Excellence. It is here that the data should be discussed including the use of correct units and a demonstration of understanding where the data has come from and what it means in terms of the context. Analysis A scatter plot is used to show how two variables are associated. If a population is being studied and in particular variable and are bing looked at, then each dot on the scatter plot represents the values and for an individual member of the population. The whole plot gives the visual representation of the entire sample.
A side note here, remember the names of variables are capital letters and a particular value of that variable is represented using the lower case version of the same letter. Unlike in a Time Series, the data points are not connected by line segments. Instead, when a pattern emerges in the placement of the data point, a line of best fit, or trend line is added. Usually you will fine ( ) on that line, where is the mean of the variable and is the mean of the variable. When analyzing a scatter plot, the mnemonic TARSOG will help to focus comments about specific features that are present. T A R S O G is for Trend, is it linear or something else? is for Association, is it positive or negative? is for Relationship, is it strong or weak? is for Scatter, is it constant or not? Fan shaped? is for Outliers, are any identifiable? is for Groups, are there any? This trend line (something inzight will produce) will be used later to make predictions of the response variable for particular values of the control variable. The fitting of a trend line initially is an arbitrary decision to choose it to be linear. The linear option is checked out first as it is the most simple and the easiest to interpret in context with any type of tangible meaning. The other options in inzight are quadratic (parabolic) and cubic. At this point it is a visual check as to the fit-ness of the model. Throughout the rest of this section, there are discussions that lead to evidence to support or reject the use of a linear trend line. The association of the data values looks at where there is a positive (as the control variable increases, so does the response variable) or a negative (as the control variable increases, the response variable decreases) association. When inzight gives the equation of this trend line it also produces another value it calls correlation. The correct name for this value is in fact the correlation coefficient and is often assigned the letter. The correlation coefficient can range in value from - 1 (a perfect negative association) through to 1 (a perfect positive association). This number allows the assignment of a description of the strength of the relationship.
As a general rule of thumb, these descriptions of the relationship present and values are acceptable: 1 0.75 0.5 0.25 0-0.25-0.5-0.75-1 Strong Moderate Weak None None Weak Moderate Strong Outliers are a big source of variation and need to be looked into carefully. There must be good reason to remove a value from a data set as the process can dramatically alter the relationship. Also, 2-3 would be an absolute maximum to remove, and usually the removal of one outlier is sufficient to see a change. There are two distinct types of outliers, ones that do not fit the pattern of the rest of the data (the left hand graph below has it circled) and the ones that fit the pattern but are a long way from the main data set ( the right hand graph has one of these).
Outliers When trying to identify potential outliers of the first type, residuals help a lot. Residuals are the distance from the raw data to the predicted data (or trend line). These need to be calculated and graphed to back up the selection of type 1 outliers. In some cases, when graphing the residuals, a pattern will emerge, this suggests that perhaps a linear model was not the best choice. A visual check of the linear trend line on the raw data will confirm this. The programme inzight only has the option of trying a quadratic or a cubic as curved models. Other software allows the user to look at logarithmic, power and exponential models also. Sometimes when plotting bivariate data, groupings become apparent in the data. These groupings can usually be explained by looking at a third variable. This third variable is commonly a categorical variable, hence it has the ability to segregate groups of data. Conclusion Predictions form part of the conclusion as they are used to help answer the investigative question this report started with. There are two different types of predictions that should be looked into, interpolations and extrapolations. Interpolations look at predictions that are within the range of x-values present in the sample and an extrapolation looks outside that range, above or below.
An appropriate evaluation of these predictions is required and leads onto the answer of the investigative question. When summing up in the conclusion, great care must be taken when making causal relationship statements. Careful analysis of potential underlying variables must have been done in order to improve the strength of argument for or against such a claim. Have other variables that could potentially influence the response variable been considered, rather than just looking for a straight predictive relationship. To be continued