DV

Assignment

Chapter 6 - Exploring data with simple charts


1. Mean, median, and mode

Here is an excerpt from Chapter 2 (The well-chosen average) of the book “How to lie with statistics” by Darrel Huff:

“You, I trust, are not a snob, and I certainly am not in the real-estate business. But let’s say that you are and I am and that you are looking for property to buy along a road not far from the California valley where I live. Having sized you up, I take pains to tell you that the average income in this neighborhood is some $15,000 yearly. Maybe that clinches your interest in living here; anyway, you buy, and that handsome figure sticks in your mind. More than likely, since we have agreed that for the of the moment, you are a bit of a snob, you toss it in casually when telling your friends about where you live. A year or so later, we met again. As a member of some taxpayers’ committees, I am circulating a petition to keep the tax rate down, assessments down, or bus fares down. I plea that we cannot afford the increase: After all, the average income in this neighborhood is only $3,500 a year. Perhaps you go along with me and my committee in this-you’re not only a snob, you’re stingy too–but you can’t help being surprised to hear about that measly $3,500.” - Darrel Huff.

Briefly explain, with a specific imaginary example and numbers, how the author is NOT lying. Please assume that no one had left or joined the neighborhood and that the income hasn’t changed significantly.”

2. Data vs evidence

Here is an excerpt from Chapter 7 (The semiattached figure) of the book “How to lie with statistics” by Darrel Huff:

An article on driving safety published by This Week magazine, undoubtedly with your best interests at heart, told you what might happen to you if you went “hurtling down the highway at 70 miles an hour, careening from side to side.” You would have, the article said, four times as good a chance of staying alive if the time were seven in the morning than if it were seven at night. The evidence: “Four times more fatalities occur on highways at 7 PM than at 7 AM.

Why can you NOT conclude so from the evidence?

3. Percentage vs. percentage points

Say that your company’s profits in the first, second, and third years were $100, $103, and $106. Is it correct to say that your profit increased by 100 percent in the third year? Is it also correct to say that your profit increased by three percentage points in the third year?

4. Conveying a truthful/functional message via plots

The following problem is inspired by Chapter 9 (How to statisticulate) of the book “How to lie with statistics” by Darrel Huff:

To their employees, often, companies/organizations want to show that their profit is well below the expectations or at least not rising high. On the other hand, employee unions want an increase in wages and to portray how nominal wages have increased over the years.

The year is 1948. A steel company employees are currently unhappy and demanding a raise. The employees exposed a plot like the one below to show how the company is profiting compared to 1943.

In defense, the company released a chart like the one shown below showing that a vast proportion of their expenses is wages (the red wall) and that both (profits and wages) have increased more or less proportionately in the last year.

The table below shows the actual data.

Which of the two plots is “truer” and “functional”? What graphic/s would you design to portray a much truer/functional story? Design/build your data graphic.

Data: wages-vs-profit.csv

5. The “unweighted” mean

The table below shows average student evaluations of a professor for the various courses he has taught in the past few years. The department chair has calculated this professor’s average (mean) rating to be 4.64 (the sum of all the courses’ ratings divided by the total number of courses). Discuss why the department chair’s calculation is wrong. Suggest the correct approach and calculate the correct answer.

Data: professor-ratings.csv

6. Plotting meaningful histograms (too large bin size)

The histogram below shows the distribution of age in the Pima Diabetes dataset (n=768). The histogram effectively shows that most individuals in the data are 20 to 45 years old, with more being on the smaller age side. For example, there are 200+ individuals in the age group 20-25 (bin size is 5) and less than 25 individuals in the 55-60 age group. If you investigate the data, you will find that 22-year-olds are most frequent within the 20-25 age group. You will also find that there are no 20-year-olds. This histogram’s resolution isn’t high enough for such observations, i.e., the bin size is too large. Plot a more effective histogram such that looking at the histogram, it is clear that 22-year-olds are most frequently occurring in the data and that 20-year-olds aren’t in the data at all.

Data: pima-diabetes.csv

7. Plotting meaningful histograms (too small bin size)

The histogram below shows the distribution of BMI in the Pima Diabetes dataset (n=768). The distribution is too thin to display the general distribution of BMI. Plot a more appropriate histogram for BMI.

Data: pima-diabetes.csv

8. Jittering a plot

The strip plot below displays the median income for 3000+ U.S. counties. Because of the alpha (transparency in the points), it is easier to observe the high-density regions. For example, Alabama’s median income is clustered around 40K (on average). This strip plot, however, fails to highlight the outliers because of the transparency. Redesign this plot so that outliers are easier to observe.

Data: median-household-income.csv

9. Using logarithmic scales

According to the World Health Organization, road traffic injuries caused an estimated 1.35 million deaths worldwide in 2016. Source: Wikipedia.

The table below shows the list of some selected countries in North America along with their annual number of traffic fatalities in 2016. Using logarithmic scales, when needed, design/build a functional (effective) graph/plot to show the relationship between the three variables.

Here is something that you can start with (but it only has two variables).

Data: traffic-fatality.csv

10. Unequal bin histogram

A 160 players were asked how many hours they practiced each day. The table below shows the data collected. Build an unequal bin histogram for the data. Your histogram should be appropriately labeled, including a correct y-axis label, tick marks, and a legend to decode the area. Also, build a cumulative frequency graph for the same data.

Hours Frequency
0–1 53
1–3 72
3–5 13
5–10 12
10–24 10

11. How the mean, median, and mode change

This problem is adapted from the chapter on “Measuring Central Tendency” from the Head First Statistics book.

The generous CEO of Starbuzz Coffee wants to give all his employees a pay raise. He’s not sure whether to give everyone a straight $2,000 raise or whether to increase salaries by 10%. The mean salary is $50,000, the median is $20,000, and the mode is $10,000.

a) What happens to the mean, median, and mode if everyone at Starbuzz is given a $2,000 pay raise?
b) What happens to the mean, median, and mode if everyone at Starbuzz is given a 10% pay raise instead?
c) Which sort of pay raise would you prefer if you were earning the mean wage? What about if you were on the same wage as the mode?