STAB22 section 2.1. Figure 1: Scatterplot of price vs. size for Mocha Frappuccino

Size: px
Start display at page:

Download "STAB22 section 2.1. Figure 1: Scatterplot of price vs. size for Mocha Frappuccino"

Transcription

1 STAB22 section We re changing dog breed (categorical) to breed size (quantitative). This would enable us to see how life span depends on breed size (if it does), which we could assess by drawing a scatterplot. (A scatterplot is no good if one of the variables is categorical, but if we had several life spans for each breed, we could draw side-by-side boxplots to see how the breeds compare.) 2.3 Both ounces and price are quantitative variables, and so we could draw a scatterplot to see how they are related. (We might expect that bigger sizes cost more, though a Venti (24 ounces) costs less than twice a Tall (12 ounces), even though it s twice the size. I have problems with a company that calls its smallest serving a Tall, but that may just be me.) If you leave the variable Size as categorical, there is no nice way to make a graph. The individuals (cases) here are cups of Mocha Frappuccino. 2.4 The price of a drink depends on the size (rather than vice versa, logically). So price should be the response and size the explanatory variable, and on your scatterplot price should be on the vertical y scale. I typed the numbers into Minitab and produced the plot shown in Figure 1, though you could just as easily do this by hand. As the size goes up, the price goes up as well, but not in a straight-line way: the relationship looks less steep as the size increases (reflecting the fact that a 24-ounce drink costs the least per ounce of coffee, because the coffee itself is only one component of the price, and there is also the fixed cost of hiring a barista to serve you, however big a drink you have). 2.6 The first test comes before the final exam chronologically, so the final exam score should be the response (and go on the vertical scale on your scatterplot). Again, this one could be done either by hand or by using Minitab (your choice). I used Minitab, with the results shown in Figure 2. Select Scatterplot and Simple, then select the response as Y and the explanatory as X. There Figure 1: Scatterplot of price vs. size for Mocha Frappuccino is essentially no relationship between the two scores: if you knew the first test score, that would not help you at all in predicting the final exam score. This might be because the first test came very early in the course, and the material it tested was very different from that on the final exam. Or students might react to their first test result: a student who scores poorly might study hard for the final, and a student who scores well might relax a bit too much before the final. Figure 2: Scatterplot of first test and final exam scores 1

2 2.7 Again, the final exam score will be the response. My scatterplot is shown in Figure 3. This appears to be something of a positive association (more so than in Figure 2, anyway), so that knowing the score on the second test helps a bit in predicting the final exam score. (Note that the student who does best on the 2nd test, 175, does well on the final, and the two students who score under 150 on the second test don t do very well on the final either.) By the time the second test comes around, usually late in the semester, it will usually be pretty clear what material is going to be tested (pretty much the same stuff that will be on the final), so a student who does well on one will probably do well on the other (and will know how hard they need to study for the final). Figure 3: Scatterplot of second test and final exam scores 2.9 Think of whether one variable might be the cause of the other, or whether the two variables are just things that happen to go together. In (b) and (e), the two values in each case are obtained at the same time, and so they just go together (or not): just explore the relationship in each case. In (a), older children will tend to be heavier, so that if you knew the age of a child, you wold be able to predict their weight. Being able to say if I knew x, I would be able to predict y means that x is explanatory and y is the response: here age is explanatory and weight is the response. In (c), if you knew how many bedrooms the apartment has, you could make a guess at its rental price. Thus bedrooms is explanatory, and rental price is the response. In (d), likewise, if you knew how much sugar a cup of coffee has, you would be able to guess how sweet it would taste. (A more interesting setup would be to have a friend prepare three cups of coffee with differing amounts of sugar in, and then, by tasting, you would rank them in order of sweetness. If you re a big coffee drinker, you would probably get pretty close to the right order.) In each of (a), (c) and (d) here, you could make a case for the explanatory and response variables being the other way around, but the major interest would be in the relationships as described above. For instance, if you knew the weight of a child, you could guess their age, but you would normally want to do it the other way around Parents income is explanatory and college debt is the response, because parental income influences college debt (it comes first). These variables are both quantitative (you would measure them). If the parents have a high income, the student will not have to borrow so much money, so the debt will be low; if the parents have a low income, the student will have to borrow a lot of money to pay tuition, living expenses and so on. So we would expect a negative association. This is assuming that parents will pay their children s college expenses, if they can. This isn t always the case. Some students work while they re at school (or during the summers) and save what they earn, and such students can be expected to graduate with a lower debt than they would otherwise have had IQ is supposed to be a measure of general intelligence, and we 2

3 would expect more intelligent children to be more interested in and more skilled in reading. This would be especially true for children in the same grade (and thus of about the same age). In Figure 2.6, children with higher IQ scores generally have higher reading scores, though there is a lot of scatter. There are four children (with IQs between 100 and 130, and reading scores less than 20) that don t seem to follow the general trend. Their reading scores are about 40 points less then you would expect based on their IQ; these children could have some kind of developmental problems that hinder their reading even though they score well on general intelligence. Ignoring the outliers, the trend is roughly linear (there is no obvious curve to the relationship, which is how you tell). But it isn t very strong: there is a lot of scatter in the in the picture, which is another way of saying that if you know a child s IQ, you wouldn t be able to predict their reading test score very accurately. (There is more to reading than general intelligence, in other words.) 2.13 As on a normal probability plot, when you see a stair-step pattern like this, it means that one of the variables only takes a few different values. Here, it s the child s self-estimate of reading ability, which can only be 1, 2, 3, 4 or 5. There are 60 children, so there are several with the same self-estimate. Having said that, children with a high test score also tend to have a high self-estimate (all of the children with test scores above 80 rate themselves 3 or better). Likewise, the children with a test score below 40 rate themselves 3 or worse, with one exception. This exception is the one outlier: a test score of about 10, and a selfestimate of 4, which is a serious over-estimate (looking at the plot, you would expect this child to have a self-rating of 1 or maybe 2) This is most easily done with Minitab. You can get the data from Table 1.10 off the disk (with the textbook); you can open the.mtp file in Minitab. First get rid of the Honda Insight. This is the car with the highest gas mileages, and is very different from the other cars. Click on the number 10; this highlights the whole 10th row of the data. Right-click on one of the highlighted cells, and select Delete Cells. (Or hit the Delete key while you have the whole row highlighted.) The Honda Insight disappears. Here s how to get the plot you need. First notice that the worksheet in Minitab has one table and three columns (unlike Table 1.10 in the text): the first column is an M or a T corresponding to the type of car. We re going to use this column to help us make the plot. Select Graph and Scatterplot. Select the second option, With Groups (in version 14). In the dialogue that appears, put the cursor under Y variables and select Hwy (the response), then put the cursor under X variables and select City (explanatory). Then make the groups: under Categorical Variables for Grouping, click on the box and select Type (which appears in the list on the left). When you ve done all that, click OK. I got the plot shown in Figure 4. Figure 4: Scatter plot of gas mileages The plot shows black circles for minicompact cars, and red squares for two-seater cars, as shown in the legend on the right. There is a clear positive association; cars with good city gas mileage 3

4 have clearly better highway gas mileage also. The plot is roughly linear (ask yourself is there a clear curve in the trend, which here there is not). Imagine separating out the reds and the blacks; the relationship appears to be about the same for the two types of cars. The major difference is that there are some two-seater cars with very poor gas mileage (bottom left of plot). You can look back at the data to see which cars these are: the ones with highway gas mileage less than about 16. These are the two Lamborghini models and the two Ferrari models. If you own a Lamborghini or a Ferrari, gas mileage is not what you re worried about! outliers. The data do suggest that distress from social exclusion is related to brain activity in the pain region Same procedure in Minitab: get the data from the disk into a worksheet, and select Graph and Plot, with the right variable (cycle length, here) as the response, Y, variable. My plot is shown in Figure I did this in Minitab again (though you could do this one by hand if you really want to). Get the data from the disk into Minitab; treat brain activity as the response. Select Graph and Plot, and select the two variables into Y and X with brain activity as Y. My plot is in Figure 5. Figure 6: Plot of cycle length against day length The point on the far right (with day length close to 24) is an outlier, because it is not part of the general pattern. You could claim that there is a positive association, but it is very weak: if you try to predict cycle length from day length, your prediction won t be very accurate. Figure 5: Scatter plot of brain activity against social distress The relationship shows an upward trend: a higher score on the distress scale leads to a higher brain activity measurement. The relationship is more or less linear and fairly strong. I don t see any 2.19 My plot of team value against revenue is in Figure 7. There appear to be five outliers: the three teams with revenue less than 80 and value higher than the other teams with the same revenue, and the two teams top right with the highest revenues. (You could argue that the latter two teams just happen to have high revenues but are on the line that marks the general trend.) To find out which teams these are, look back at the data: the Grizzlies, 4

5 Cavaliers and Rockets have higher values than the revenues suggest, while the Lakers and Knicks have high revenues and values. There is a more or less linear trend with a positive association. The relationship is quite strong. Compare the plot of team value against operating income, Figure 8. There is much less of a trend, so it s harder to talk of outliers, just points that don t fit the overall scatter. The Lakers and Knicks again stand out as the teams with highest value. The team over on the left with negative operating income is the Trailblazers. If you had to predict value, revenue is the better variable to use because the relationship is stronger. Figure 7: Plot of team value against revenue 2.20 Perhaps the severity of MA can help predict the severity of HAV is the clue that MA is explanatory and HAV the response. So put HAV on the vertical scale and MA on the horizontal of your scatterplot. My scatterplot is in Figure 9. Figure 9: Scatterplot of HAV angle vs. MA angle Figure 8: Plot of team value against operating income There is something of a positive trend here (you might call it a weak-to-moderate trend). The patients with higher MA angle do tend to have a higher HAV angle. There is one outlier: the patient with HAV angle 50. 5

6 There is a relationship, but it s not very strong, so MA angle could be used to predict HAV angle. It s just that there is so much scatter that the predictions wouldn t be very good This is the same idea as 2.16, and can be done the same way in Minitab. The last sentence of the first paragraph in the text gives you a clue as to what should be on the y-axis: rate is the response, and mass the explanatory variable. So get a scatterplot of Rate against Mass, with groups, and use Sex as the grouping categorical variable. Your plot should look something like Figure 10. the opposite question: how does fuel consumption change as speed changes? So fuel consumption is the response, and speed the explanatory variable. There s nothing else new about making the plot, as shown in Figure 11. Figure 11: Fuel used against speed Figure 10: Metabolic rate vs. lean body mass Looking at all the data, the relationship is positive (larger lean body mass goes with larger metabolic rate), and the trend looks linear. The relationship looks quite strong, except perhaps at the upper end. Separating out the men and women, some of the men (red squares) have large lean body mass and large metabolic rate, and the trend overall for the men is not as clear as it is for the women (black circles). (Most of the larger values are men, and all of the smaller values, on both variables, are women.) 2.22 It s tricky to sort out the roles of the variables here. Normally, fuel consumption would lead to speed, but here we are asking The relationship goes down and then up, so you can t describe it with a straight line. It s a curve. Because of the way fuel usage is measured here, a low value is good: a lot of gas is used at low speeds and at high speeds, with a best value coming in between, here at about 60 km/h. (The same kind of picture happens for other cars: there is a best speed for fuel consumption which is less than typical highway speeds.) Because the relationship doesn t go consistently either down or up, it doesn t make sense to describe it as either a positive or negative association. The relationship is actually quite strong: if you were to use a curve to describe this relationship, you d be able to predict fuel usage quite accurately from speed. You just wouldn t be able to describe the relationship by a straight line. (Later, we learn to calculate a number called the correlation, which describes how strongly linear a relationship is; here the correlation would be quite small, 6

7 because, even though there is a strong relationship, it doesn t look like a straight line.) 2.23 This one could be done by hand (as long as you take care to get the vertical scale sensible). Or you can do it in Minitab. The tricky part is getting the data in the right form; as the data come off the disk, all the years and record times are in one column (each), with the sex of the athletes in a third column. You can copy and paste the women s times into two new columns, in a more or less obvious way: select the values you want to copy by clicking and dragging, move the cursor to the top of a new column, and then paste. Then make a scatterplot of time (y) against year (x). My plot is in Figure 12. The plot shows a big jump before 1970, then a steady rate of improvement until the mid-80s, and a slower rate of improvement since then. (Note that the women s 10,000 metres only became an Olympic event in 1988, so that more attention may have been paid to training since that time. That would explain the lack of large improvement since the mid-80s.) Figure 12: Women s record times for 10,000 m race 2.25 To get the plot with men and women s records separately labelled, use the same idea as 2.16 and 2.21: do a scatterplot with groups, and select Sex as the grouping variable. Figure 13: Men and women s 10,000 record times Men (red squares) have been running this event for longer than women black circles), so their history is longer. But the women s record appears to have been dropping more quickly than the men s. In recent years, though, the women s record hasn t dropped very much, while the men s has dropped more quickly. So the data support the first claim of (b), but not the second (the men s record is still less than the women s, with no apparent sign that the women are going to catch up) The 2002 returns are mostly negative, reflecting the fact that the stocks composing the mutual funds mostly fell, and the 2003 returns are mostly positive, for the opposite reason. I also drew a scatterplot of the 2002 and 2003 returns, as shown in Figure 14. There is one outlier on the right (as mentioned in the text); apart from this, there seems to be a downward trend (negative association), saying that funds that did badly in 2002 (lost a lot of their value) did well in 2003, and stocks that did well (less badly) in 2002 did poorly in I don t know what this says about prospects of success when you invest in mutual funds, but it does 7

8 suggest a bounce-back effect: funds that do especially badly one year will recover the next. (Mutual funds are designed to be good collections of stocks, to allow small investors to diversify and protect themselves against extreme behaviour of the market.) Figure 14: Scatterplot of 2002 and 2003 returns 8