Aug 1 9:38 AM. 1. Be able to determine the appropriate display for categorical variables.

Size: px
Start display at page:

Download "Aug 1 9:38 AM. 1. Be able to determine the appropriate display for categorical variables."

Transcription

1 Chapter 3 Displaying and Describing Categorical Data Objectives: Students will: 1. Be able to determine the appropriate display for categorical variables. 2. Be able to summarize the distribution of a categorical variables with a frequency table. 3. Know how to construct and analyze a contingency table. 1

2 The three rules of data analysis won t be difficult to remember: pg 21 1.Make a picture things may be revealed that are not obvious in the raw data. These will be things to think about. 2.Make a picture important features of and patterns in the data will show up. You may also see things that you did not expect. 3.Make a picture the best way to tell others about your data is with a well chosen picture. Frequency tables are often used to organize categorical data. Frequency tables display the category names and the counts of the number of data values in each category. pg 21 A relative frequency table is similar, but gives the percentages (instead of counts) for each category. 2

3 Explain to your partner if we know the counts how can we get the percentages for each category pg 22. Example 1 Find the relative frequency and percents for each color of M&M's Color frequency relative frequency Percent Blue 13 Red 7 Orange 11 Green 9 Yellow 8 Brown 7 Total % Aug 3 1:17 PM 3

4 Aug 21 3:48 PM You might think that a good way to show the Titanic data is with this display: pg 22 Tell your partner what you think is wrong with this display. 4

5 The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride. When we look at each ship, we see the area taken up by the ship, instead of the length of the ship. The ship display violates the area principle: The area occupied by a part of the graph should correspond to the magnitude of the value it represents. pg 22 Bar Charts pg 23 A bar chart displays the distribution of a categorical variable, the height of each bar represents the counts for each category. showing the counts for each category next to each other for easy comparison. A bar chart stays true to the area principle. Thus, a better display for the ship data is: 5

6 pg 23 A relative frequency bar chart displays the relative proportion of counts for each category. A relative frequency bar chart also stays true to the area principle. Replacing counts with percentages in the ship data: The sum of the relative frequencies is. 100% Aug 21 3:50 PM 6

7 Pie Charts pg 23 Pie charts is another type of display used to show categorical data. When you are interested in parts of the whole, a pie chart might be your display of choice. They slice the circle into pieces whose size is proportional to the fraction of the whole in each category. Contingency Tables A contingency table allows us to look at two categorical variables together. pg 24 It shows how individuals are distributed along each variable, contingent on the value of the other variable. Example: we can examine the class of ticket and whether a person survived the Titanic: 7

8 The margins of the table, give the frequency distributions for each of the variables also called marginal distribution. Both on the right and on the bottom, give totals and the frequency distributions for each of the variables. pg 25 The marginal distribution of Survival is: pg 25 Each cell of the table gives the count for a combination of values of the two values. For example, the second cell in the crew column tells us that 673 crew members died when the Titanic sunk. 8

9 Examine the class data about gender and political view liberal, moderate, conservative. What percent of the class are girls with liberal political views? What percent of the liberals are girls? What percent of the girls are liberals? What is the marginal distribution of gender? What is the marginal distribution of political views? Aug 3 1:26 PM Ways to present categorical data You ve seen data represented in newspapers, magazines, online. How do you normally see it? Tables (frequency tables) Bar charts Pie charts Line graphs Contingency tables Aug 23 8:19 PM 9

10 A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. The following is the conditional distribution of ticket Class, conditional on having survived: pg 26 The following is the conditional distribution of ticket Class, conditional on having died: pg 26 10

11 A conditional distribution shows the distribution of one variable for only the individuals who satisfy some condition on another variable. The conditional distribution of political preference, conditional on being male: Male Liberal Moderate Conservative Total The conditional distribution of political preference, conditional on being female Female Liberal Moderate Conservative Total : What is the conditional relative frequency distribution of gender among conservatives? Aug 3 1:31 PM pg 26 The conditional distributions tell us that there is a difference in class for those who survived and those who died. This is better shown with pie charts of the two distributions: 11

12 We see that the distribution of Class for the survivors is different from that of the nonsurvivors. This leads us to believe that Class and Survival are associated, that they are not independent. The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable. pg 28 Continue next slide If the conditional distributions are the same, we can conclude that the variables are not associated. Therefore, they are independent of one another. If the conditional distributions differ, we can conclude that the variables are somehow associated. Therefore, they are not independent of one another. Are gender and political view independent? Aug 3 1:35 PM 12

13 A segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles. Comparing segmented bar charts is a good way to tell if two variables are independent of one another or not. Gender vs. Political Preference 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Male Female Aug 21 4:09 PM Segmented Bar Charts A segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles. Here is the segmented bar chart for ticket Class by Survival status: See example Step by Step pg 29 pg 29 13

14 Pair Share Activity I think the "area principle" is violated because. I think the graph is wrong because. Aug 3 1:41 PM Player A has a higher batting average (0.320 vs ), so he looks like the better choice Aug 3 1:44 PM 14

15 Aug 3 1:45 PM Player B has a higher batting average against both right and left handed pitching, even though his overall average is lower. Player B hits better against both right and left handed pitchers. So no matter the pitcher, B is a better choice. So why is his batting average lower? Because B sees a lot more right handed pitchers than A, and (at least for these guys) right handed pitchers are harder to hit. For some reason, A is used mostly against left handed pitchers, so A has a higher average. Aug 21 4:21 PM 15

16 What Can Go Wrong? pg 31 Don t violate the area principle. While some people might like the pie chart on the left better, it is harder to compare fractions of the whole, which a well done pie chart does. What Can Go Wrong? (cont.) Keep it honest make sure your display shows what it says it shows. pg 31 This plot of the percentage of high school students who engage in specified dangerous behaviors has a problem. Can you see it? 16

17 What Can Go Wrong? (cont.) pg 32 Don t confuse similar sounding percentages pay particular attention to the wording of the context. Don t forget to look at the variables separately too examine the marginal distributions, since it is important to know how many cases are in each category. What Can Go Wrong? (cont.) pg 32 Be sure to use enough individuals! Do not make a report like We found that of the rats improved their performance with training. The other rat died. 17

18 What Can Go Wrong? (cont.) pg 33 Don t overstate your case don t claim something you can t. Don t use unfair or silly averages this could lead to Simpson s Paradox, so be careful when you average one variable across different levels of a second variable. What have we learned? pg 34 We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents). We can display the distribution in a bar chart or pie chart. And, we can examine two way tables called contingency tables, examining marginal and/or conditional distributions of the variables. 18

19 b)what percent of the shoppers with only high school education were smokers? c) What percent of the smokers had only high school education? Aug 3 2:25 PM Aug 3 2:26 PM 19

20 Aug 3 3:45 PM Aug 3 3:46 PM 20

21 Aug 3 3:50 PM Aug 3 3:50 PM 21

22 Aug 3 3:33 PM Aug 3 4:05 PM 22

23 Aug 3 3:56 PM Sep 8 10:22 PM 23