Week 13, 11/12/12-11/16/12, Notes: Quantitative Summaries, both Numerical and Graphical.

Size: px
Start display at page:

Download "Week 13, 11/12/12-11/16/12, Notes: Quantitative Summaries, both Numerical and Graphical."

Transcription

1 Week 13, 11/12/12-11/16/12, Notes: Quantitative Summaries, both Numerical and Graphical. 1 Monday s, 11/12/12, notes: Numerical Summaries of Quantitative Varibles Chapter 3 of your textbook deals with statistics, parameters, and numerical summaries. Sample statistics are numerical measures of location, dispersion, shape, association, etc. that are computed for data FROM A SAMPLE. Population parameters are numerical measures of location, dispersion, shape, association, etc. that are computed for data FROM A POPULATION. Note: most of the time, we will just say statistic or parameter. Keep in mind that statistics are always from the sample and parameters are always from the population. In most cases, parameters are denoted by Greek letters, and statistics are denoted by their English alphabet counterparts. Additionally, sometimes statistics are referred to as point estimates of the parameter that they represent. This concept is especially prevalent during hypothesis testing and confidence interval construction. Mean is the average value or expected value. The population mean is represented by mu, µ. If necessary, you can add a subscript to avoid confusion, like µ x vs µ y. The sample mean is represented by x-bar, x. Computation of x: The population variance is denoted as σ 2, while the sample variance is denoted by s 2. They are computed as such: Mode is the value that occurs the most (has the highest frequency). Range = largest value (maximum) - smallest value (minimum). Percentile is best represented with an example. The p th percentile is a value of the data set (or distribution) such that at least p% of the data set (or distribution) is this value. There are 3 special percentiles, call the quartiles. The quartiles split the data into 4 parts. The lower quartile, median (aka the 2 nd quartile), and the upper quartile are the 25 th, 50 th, and 75 th percentiles respectively. The lower and upper quartiles are sometimes known as the first and third quartiles.

2 We typically abbreviate these 3 values as Q1, M, and Q3. Calculation of Percentiles (and Quartiles) using the indexing method (see page 86 of your text for a reference).: Interquartile Range, or IQR is Q3 - Q1. A boxplot is a visual representation of the 5 number summary. The 5 number summary is the minimum, Q1, the median, Q3, and the maximum. Boxplots have different types. Namely, there is a regular boxplot and a modified boxplot. The modified boxplot will highlight if there are outliers, but a regular one will not. Your teacher will demonstrate both of these versions. Please keep in mind that there are different variations of a modified boxplot. An outlier is a data point that does not fit with the rest of the data. In a univariate case, this number can be either too small or too large. In a bivariate case, it would be a data point that does not fit the overall trend of the variables taken together. Here is our outlier test: Example 13.1 Hank Aaron hit an astounding 755 home runs in his career. His career spanned from 1954 through In those 23 seasons he hit 13, 27, 26, 44, 30, 39, 40, 34, 45, 44, 24, 32, 44, 39, 29, 44, 38, 47, 34, 40, 20, 12, 10. What is the mode of the data set? What is the range of the data set? Create both a regular and a modified boxplot for the number of home runs that Hank Aaron hit in a season. Find the 61 st percentile. Example 13.2 A Stat 113K class was asked how many times they wanted to eat ice cream last summer. The answers given were: 0, 15, 18, 7, 15, 28, 10, 20, 3, 10, 6, 10, 8, and 9. What is the mode of the data set? What is the range of the data set? Create both a regular and a modified boxplot for the number of times the students wanted to eat ice cream.find the 18 th percentile. Example 13.3 Suppose we have the data set 1, 2, 3, 4, and 5. Find the mean of the data. Also compute variance in 2 ways (one assuming that this is a sample, the other assuming that this represents the entirety of the population). For these 2 different variance calculations, how would you denote the mean? Example 13.4 Suppose we have the data set -4, -2, 0, 2, and 4. Find the mean of the data. Also compute variance in 2 ways (one assuming that this is a sample, the other assuming that this represents the entirety of the population). How does the variance relate to that in example 13.3? Is this suprising or can you show why this is true?

3 2 Wednesday s, 11/14/12, notes: Applications, Types of Data, and Summarizing Data Statistics is the science of collecting, analyzing, presenting, and interpreting data. Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation. Data set all the data collected in a particular study. Elements are the individual entities of a data set. A variable is a characteristic of interest for the elements. An observation is the set of measurements obtained for a particular element. There are two main types of variables, qualitative (aka categorical) and quantitative (aka numerical). Qualitative data has labels or names used to identify an attribute of an element. Qualitative data use either the nominal or ordinal scale of measurement. Nominal scale is such that order does not matter. Ordinal scale is such that order does matter. The order or rank of the data is meaningful. Quantitative data has numeric values that indicate how much or how many of something. Quantitative data uses either the interval or ratio scale. Interval scale has ratios of quantities that cannot be compared. Ratio scale has ratios of quantities that are meaningful. Note: We can use numeric values to represent categoric data. This is often done when working with a data set. For example, suppose we are interested in grade level of a student. Instead of using the values of Freshman, Sophomore, Junior, and Senior, we could use the values 1, 2, 3, and 4. Since the numbers represent categories, grade level is a qualitative variable.

4 When referring to a variable, we can describe it is qualitative or quantitative, and one of nominal, ordinal, interval, or ratio. Cross-sectional data is data collected at the same or approximately the same point in time. Time series data is data collected over several time periods. Example 13.5, Wabash College student data set Gender Grade Hometown Major Pieces of Candy Consumed Male Sophomore Indianapolis Psychology 15 Male Senior Crown Point Spanish 12 Male Senior Lombard Religion 8 Male Freshman Indianapolis Philosophy 10 What is the entire spreadsheet of data called? Each student is what? How many elements are in the data set? How many variables are in the data set? List the 3 rd observation. What type of variable is each variable in the data set (be sure to answer both qualitative or quantitative as well as nominal, ordinal, interval, or ratio). Example 13.6, due to Allison For this example, answer what type of variable each of the following are (be sure to answer both qualitative or quantitative as well as nominal, ordinal, interval, or ratio). Smoking status, SAT score, income, level of satisfaction, GPA, clothing size (s, m, l, xl), and time taken to run a mile. Example 13.7 For this problem, state whether the variables included are cross-sectional or time series. Current GPAs of Purdue Statistics Graduate Students vs. GPA of Sanvesh during his time at Purdue. Value of Gordan Gecko s portfolio over the previous 3 years vs. Value of all portfolio s at Charles Schwaab in January Total salary of the LA Lakers throughout the 1990s vs. Salaries of all NBA teams in Where does data come from? Sources of data can be existing sources (employee records, student records, medical history, etc.), surveys (teacher evaluations, amazon buyer reports), experiments, or observational studies. Population is the set of all elements of interest in a particular study. Sample is a subset of the population.

5 Census is a survey designed to collect data from the entire population. Statistical inference is the process of using data obtained from a sample to make estimates or test hypotheses about the characteristics of a population. Some of the reasons that people use samples as opposed to looking at the whole population are time, money, etc. Types of Sampling Simple random sampling, abbreviated SRS is a sample selected such that each possible sample of size n has the same probability of being selected. Another way to say this is that each element in the population has an equal chance of being picked to be in the sample. Sampling with replacement has sampling where the elements are put back in the population after being selected for the sample. This allows an element a chance of being selected more than once for a single sample. Sampling without replacement has sampling where the elements are not put back in the population after being selected for the sample. This allows an element a chance of being selected at most once for a single sample. Stratified random sample is a probability sampling method in which the population are first divided into strata (groups) and a simple random sample is then taken from each stratum. Probability sampling is sampling where elements are selected from a population with a known probability of being included in the sample. It could give equal probability to each element (this is the SRS) or to elements in a group (stratified sampling) or have any legitimate probability model for inclusion for each element. Cluster sampling is sampling where the elements in the population are first divided into separate groups called clusters and then a simple random sample of the clusters is taken. This means that all elements in a selected cluster are part of the sample. Systematic sampling is a probability sampling method in which we randomly select one of the first k elements is selected and then every k th element thereafter is picked. Convenience sampling is a nonprobability method of sampling whereby elements selected for the sample are on the basis of convenience. Judgment sampling is a nonprobability method of sampling whereby elements are selected for the sample based on the judgment of the person doing the study.

6 Example 13.8, due to Ellen Gundlach I am going to write this in terms of lines. Elegant, extravagant elephants entertain every evening at seven. They serve escargot and eggs benedict and endive. Eight elderly elegant elephants elevate themselves to the expensive entrance with elevators exceeding expectations. Eating everything edible, elephants expan exponentially. Excellent! the entertained elephants express after the entertaining entrees were served. Everything was expedited by the energetic efforts of the executive elephant empress. Everyone was entertained to excess and enjoyed the edible endeavors immensely. The evening ended enchantedly with Echinacea herbal tea. This example will be lead by your instructor. Summarizing Data Information Bias is an important concept in statistics. It can refer to the design of a study, the way a questions is asked, or the value of a statistic. A design is said to be biased if it systematically favors certain outcomes. This can apply to how a question is asked too. Bias can also be defined as consistent, repeated deviation of the sample statistic from the population parameter in the SAME direction when we take many samples. This means that the statistic is either always below the parameter or it is always above the true value. When creating a survey, you want to pay particular attention to trying to avoid bias. Some things to avoid are confusing wording, asking a question no one would remember, leading the question to a certain answer, and asking embarrasing (or very personal) questions. How to summarize qualitative data: You can use a frequency distribution, percent relative frequency, bar or column graphs, and pie charts. Frequency Distribution is a summary of data showing the number (frequency) of data values in each of several nonoverlapping classes. Relative Frequency Distribution is a summary of data showing the fraction or proportion of data values in each of several nonoverlapping classes. Percent Frequency Distribution is a summary of data showing the percentage of data values in each of several nonoverlapping classes. Typically the above 3 distributions are summarized in table form. The relative frequency distribution is akin to a pmf. The above 3 distributions can also be represented by a bar graph or pie chart. Bar graph is a graphical device used for depicting qualitative data that have been summarized by any of the above 3 distributions. Pie chart is a graphical device used for presenting data summaries based on a subdivision of a

7 circle into sectors that correspond to the relative frequency for each class. How to summarize quantitative data: You can use dot plots, relative or % frequency, histograms, cumulative distributions, or stem and leaf plots. Dot Plot is a graphical device that summarizes data by the number of dots above ach data vlue on the horizontal axis. Histogram is a graphical presentation of a frequency distribtion, relative frequency distribution, or percent frequency distribution of a quantiative variable. It is constructed by placing the class intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis. When making a histogram, you need to pick an adequate number of classes (or, equivalently, an appropriate width of the interval for each class). You do not want to have too few classes that you lose most of the information, nor do you want to have too many classes so that most of the frequencies are low. It should be noted that while bar graphs look similar to histograms they are quite different. Their similarities are that they are constructed using bars and the y-axis is one of frequency, percent frequency, or relative frequency. Their main difference is that a bar graph summarizes a qualitative variable and a histogram summarizes a quantitative variable. Additionally, the bars in a histogram touch, but the bars in a bar graph do not touch. The reason for this last difference is about the use of histograms. You want to get an idea of the distribution of your variable. We can look at a histogram in much the same way as a pdf. Often a use of a histogram is to try and see if you can fit a named distribution (like a Normal or Exponential) to variable of interest. Cumulative Frequency Distributionis a summary of quantiative data showing the number of data values that are less than or equal to the upper class limit of each class. If you had a data set of n values, we could think of the cumulative frequency distribution as being n*f(x), where F(x) is the cdf as defined previously. Cumulative Relative Frequency Distribution is a summary of quantitative data showing the fraction or proportion of data values that are less than or equal to the upper class limit of each class. This is equivalent to the cdf. However, the definition might be a little strange as it has been adapted to fit the concept of a histogram (using class limits as opposed to the data value). This definition is used in the case where you do not know the data, just a summary of the data. Cumulative Percent Frequency Distribution is a summary of quantitative data showing the percentage of data values that are less than or equal to the upper class limit of each class. Ogive is a graph of a cumulative distribution. There is another type of graph not mentioned above, the line graph. A line graph is used to sum-

8 marize time series data. A typical line graph has time on the x-axis and the variable on the y-axis. 3 Friday s, 11/16/12, notes: Graphical Summaries of Variables plus Probabilities with a Crosstab Stem-and-leaf plot is a technique that orders quantiative data points and provides insight about the shape of the distribution. To make a stem-and-leaf plot, the last digit of the number is the leaf and the rest of the number is the stem. Additionally, any stem that is not used, but is within the range of the data, is kept in the plot. You can create split-stem plots or trimmed data stem-and-leaf plots also. Example 13.9 Suppose our data set is the numbers 1, 3, 5, 7, 12, 15, 17, 19, 21, 21, 21, 30, 33, 39, and 56. Create a stem-and-leaf plot of the data. Scatter Diagram or scatterplot is a graphical representation of the relationship between 2 quantitative variables. This topic will be addressed on November 30 th. Crosstabulations (sometimes known as contingency tables) is a summary of data for 2 qualitative variables. The classes for one varaible are the rows and the classes for the other variable is the columns. The entries of the table are a frequency. When we look at crosstabulations, we examine 3 types of probabilities: joint, marginal, and conditional. Joint distribution is how the 2 variables are distributed together. Marginal distribution is how 1 variable is distributed without accounting for the other variable. Conditional distribution is how 1 variable is distributed given a particular value of the other variable. Calculations of these probabilities involve cell totals, row or column totals, and the overall total. Example Suppose we polled 100 students, 50 of whom went to class yesterday and 50 did not attend class yesterday. We asked them whether or not they were happy. Suppose that 2 of the students who went to class were happy, while 40 of the students who did not go to class were happy. First, create a crosstabulation for this situation. For each of the following, state whether it is a joint, marginal, or conditional probability, and calculate the probability. That a student is happy, that the student was in class yesterday, the student was not in class and not happy, the student was happy knowing they were in class, and the student was in class knowing that they were happy.

9 Example Let us examine the following crosstabulation: What percent of men are married? Men Women Total Married Divorced/Widowed Never Married Total What percent of people in the sample are divorced/widowed? If we pick a random person who was never married, what is the probability that they are male? What is the probability that a person is married and male? Knowing the person is female, what is the probability they are divorced/widowed? Are these joint, marginal, or conditional probabilities.