Skip to content

Avoiding Data Pitfalls, Part 3: Confusing Colors

2015 November 2
by Ben Jones

This is the 3rd in a blog post series called “Avoiding Data Pitfalls” (1st, 2nd) that I’ll conclude with this blog post, as I’m turning the rest of the content into a book that I hope to publish next year.

There’s a pitfall that’s quite easy to fall into when creating dashboards with multiple charts and graphs: using color in ways that confuse people. There are many ways to confuse with color, including the oft-maligned red-green encoding that color blind viewers can’t decipher. That’s just one of many, though, and I’d like to use this blog post to illustrate three additional versions of the “confusing color” pitfall that I see people like myself fall into quite often.

Then, I’ll wrap it up by talking about the design goal that I aspire to achieve any time I create a dashboard with multiple views.

Color Pitfall #1. Using the same color hue for two different variables

The example for the first type of this common pitfall comes from a New York City Marathon dashboard (click the “Participation Trends” option at the top) that was created using Qlik. Qlik is a competitor to the company I work for, Tableau, but let me be clear that the product Qlik makes isn’t to blame for the mistake, here. I see this error made with any and every data dashboard product, including Tableau.

Without further ado, here is the portion of the dashboard that I propose could be improved upon by using a different color scheme:

Fig. 1: A NYC Marathon dashboard that uses the same color hue for different attributes

Fig. 1: A NYC Marathon dashboard that uses the same color hue for different attributes

Following my tenet to provide critique with humility, here are some pluses (things I like) and deltas (things I would change) about this dashboard:

Pluses: I like the use of color in the histogram that shows the clear cut-off points, where finishers cross the line in droves immediately before the turning of the hour, especially the 4th hour of the race. This shows how goal-setting can affect the performance of a population, and it’s fascinating.

Deltas: Notice that the same green hue, though, applies to the increase in number of finishers from Italy, Netherlands, people who finished between 4 and 5 hours, and Switzerland. Likewise, the same yellow is used to encode the increase in the number of finishers from Germany, Mexico, and people who finished between 5 and 6 hours after starting. Similarly, red has multiple meanings. Of course there is no actual relation between these particular groups, though it may seem like there is at first glance.

To avoid this confusion, I propose using entirely different color schemes for the histogram and the treemap (and not repeating any colors within the treemap itself), or, better yet, not putting these two charts next to each other at all, as they tell completely different stories.

Color Pitfall #2. Using the same color saturation for different magnitudes of the same variable

Similarly, I’ve made the mistake of using the same color saturation to effectively create two conflicting color legends for the exact same dashboard. Consider this trivial (population?) map that I created with data about mileage of California roads by county to illustrate the point:

Notice that there are two different sequential color legends on the dashboard that use the exact same turquoise color (R:0,G:102,B:99). In the choropleth, the fully saturated turquoise color maps to a specific county (Los Angeles County) with 21,747 total miles of roads. In the bar chart, the full turquoise color saturation maps to a specific road type (Local roads) with a total of 108,283 miles for the entire state. Just eyeing the dashboard in passing, the viewer may connect Los Angeles County with Local roads, and think they are connected. Or, the reader may look at the wrong color legend (if both are in fact included) and be misled about how many miles of road the county or type actually include.

From a software UI perspective, this error was easy to make because all I had to do was drag the “Miles” data field to the Tableau Color shelf in the map Sheet, and also drag it to the Color shelf in the bar chart Sheet. These two Sheets have totally different aggregation types, but I can easily force the same color encoding if I’m not paying attention.

How to Avoid this 2nd Type of Color Pitfall
Notice that the color encoding on the bar chart is actually redundant. We already know the relative proportions of the miles of different road types by the lengths of their corresponding bars, which is quite effective all by itself. Why also include Miles on the color shelf, especially considering the fact that the color would conflict with the choropleth map, where color is totally necessary?

My colleague Dash Davidson came up with a good solution: remove color from the bars altogether and just leave an outline around them:

Color Pitfall #3. Using too many color encodings on one dashboard

It’s very common to use too many color schemes on a dashboard, especially with big corporate dashboards where the various stakeholders call for everything but the kitchen sink to be added to the view.

Here’s a dashboard I create to illustrate the point – my first dashboard that uses the Sales SuperStore sample dashboard that comes with Tableau Desktop:

In this dashboard we see not just one red-green color encodings but two, and they have different extremes for the exact same measure (Profit). We also see red and green used in the scatterplot encoding, but now they refer to different regions instead of different profit levels. Finally, we have another bar chart that uses no color scheme, but each bar is blue – the same color as the Central region in the scatterplot.

You get the point. This isn’t what we want to create. I think I broke all of the rules in creating this one.

My Design Aspiration: One (and only one) color encoding per dashboard

This goal isn’t always possible, but as much as possible, I try to include one and only one color scheme on every dashboard I create. The reason is that I find it takes me a lot longer to figure out what’s going on in someone else’s dashboard when they’ve used more than one. It’s that simple.

This means I often have to make a tough choice – which is the variable (quantitative or categorical) that will be blessed with the one and only one color encoding on the dashboard? It’ll become the variable that receives the most attention, so it should be the one that is most related to the primary task the user will perform when using the dashboard.

For example, if the dashboard was created for a sales meeting in which the directors of each US sales region talk about what’s working well and what’s not working well in their respective regions, then the “Region” attribute could very well take the honored place of prominence:

I hope this was helpful! Do you have any pet peeves or recommendations when it comes to using colors on data dashboards?

Thanks for reading,

My 3 Basic Tenets of Data Visualization

2015 October 30

I have a few thoughts I’d like to put out there, in the spirit of contributing to the ongoing dialogue within the field of data visualization. I’m still relatively new to this field, having participated as a practitioner, consultant and teacher for about ten years. That decade has left me with more questions than answers, and I find more opportunities than ever to expand my knowledge and skill set, and more people than ever with unique perspectives to learn from.

There are a few things, though, that I feel particularly passionate about. The three concepts I describe below amount to “tenets” that I’d like to humbly propose others in the field consider adopting. I definitely didn’t receive them in the form of a carved stone tablet from on high. It’s just stuff I think is true and important.

1. There are no black and white rules

I don’t believe that we can ever declare that a particular visualization type or design decision either “works” or “doesn’t work“. This binary approach is very tempting, I’ll admit. We get to feel confident that we’re avoiding some huge mistake, and we get to feel better about ourselves when we see someone else breaking that particular rule. I started off in this field with that mindset.

The more I’ve seen and experienced, though, the more I prefer a sliding gray scale of effectiveness over the black-and-white “works” / “doesn’t work” paradigm. It’s true that some choices work better than others, but it’s highly dependent on the objective, audience and context. This paradigm makes it harder to decide what to do and what not to do, but I believe this approach embraces the complexity inherent in the task of communicating with other hearts and minds.

Sometimes the most effective choice in a particular situation might surprise us. Consider two analogous examples: chess and writing. Data visualization is like chess in that both involve a huge number of alternative “moves”. Garry Kasparov decided to sacrifice his queen early in a game against Vladimir Kramnik in 1994. He went on to win that game in decisive fashion. Data visualization is like writing in that both involve communicating complex thoughts and emotions to an audience. Cormac McCarthy decided to eschew virtually all punctuation in his 2009 novel The Road. He won the Pulitzer Prize for that novel. Would I recommend either of those decisions to a novice? No, but I wouldn’t eliminate them from the set of all possible solutions, either. This diagram in Tamara Munzner’s Visualization Analysis & Design illustrates why:


If we start with a larger consideration space, it’s more likely to contain a good solution. On the other hand, labeling certain visualization types as “bad” and eliminating them from the set of possible solutions paints us into a corner. Why do that?

For example, consider the word cloud. Most would argue that it’s not terribly useful. Some have even argued that it’s downright harmful. There’s a good reason for that, and in certain situations it is harmful. It’s difficult to make precise comparisons using this chart type, without a doubt. And using a word cloud to analyze or describe blocks of text, such as a political debate, is often misleading as the words are considered entirely out of context. Fair enough. Let’s banish all word clouds, then, right? Let’s malign any software product that makes it possible to create one, right?

I wouldn’t go so far. Word clouds have a valid use, as well, even if it is rare. What if we had a few brief moments during a presentation to impress upon a large room of people, including some sitting way in the back, that there are only a handful of most commonly used passwords, and they are pretty ridiculous. Wouldn’t a word cloud suffice? Would you choose a bar chart, a treemap or a packed bubble over a word cloud, in this scenario? You decide:

I admit it – I’d likely choose the word cloud in the scenario I described. The passwords jump right off the screen at my audience, even for the folks in the back. It doesn’t matter to me whether they can tell that ‘password’ is used 1.23 times more frequently than ‘123456’. That level of precision isn’t required for the task I need them to carry out. The other chart types all suffer from the fact that only a fraction of the words fit in the view. The audience can’t scan the full list at a glance to get a general sense of what’s contained in it – names, numbers, sports, batman.

If what you take from this example is that Ben thinks word clouds are awesome, you’ve completely missed my meaning. In most instances, word clouds aren’t very good at all, just like sacrificing one’s queen or omitting quotation marks entirely from a novel. But every now and then, they fit the need pretty well. We could probably come up with other scenarios where we would choose one of the other three chart types instead. Choosing a particular chart type depends on many factors. That’s a good thing, and frankly, I love that about data visualization.

2. Critique needs to be given with humility

Since there are so many variables in play, and since we hardly ever know the objective, the audience, or the full context of a particular project, we need to be humble when providing a critique of someone else’s data visualization. All we see is a single snapshot of the visual. Was this created as part of a larger presentation or write-up? Did it also include a verbal component when delivered? What knowledge, skills and attitudes did the intended audience members possess? What are the tasks that needed to be carried out associated with the visualization? What level of precision was necessary to carry out those tasks?

These questions, and many more, really matter. If you’re the kind of person who scoffs at the very mention of word clouds, your critique of my example above would be swift and harsh. And it would largely be misguided.

Open dialogue is necessary in data visualization, and healthy debate should be encouraged. When engaging in dialogue and debate, though, I try to remind myself that I don’t know all the details. Seeking to understand some of these details is an important first step. Then providing a few “pluses” and a few “deltas” almost always works. What works well (pluses), and what ideas do I have to make the visualization work better (deltas)? It’s really not that hard.

3. Freedom to innovate is necessary for growth

Lastly, I enjoy that there are so many creative and talented people in this space who are trying new things. I believe that freedom to innovate is necessary for any field to thrive, and I try to do what I can to make sure such a freedom persists in data visualization.

Making blanket statements about certain visualization types, design choices, tools, or even individuals or groups in this space isn’t helpful, and tends to reduce the overall spirit of freedom to innovate. I don’t have data to back that up, by the way, it’s just the way I feel. You may agree or disagree.

And “innovation” doesn’t just involve creating new chart types. It can also include using existing chart types in new and creative ways. Or applying current techniques to new and interesting data sets. Or combining data visualization with other forms of expression, visual or otherwise. As long as we can have a respectful and considerate dialogue about what works well and what could be done to improve on the innovation, I say bring it on.

Adding the winning ideas to the known solution space is good for us all.

I hope these tenets make sense to you. Let me know if you agree, if you’d change anything about my list, or if you’d add any tenets of your own.


How to Make a Scatterplot with Marginal Histograms in Tableau

2015 September 15
by Ben Jones

First, apologies for the blog post drought! It was that kind of summer. It’s good to be back, though, and I hope you’ve been well.

Scatterplots are my favorite visualization type, hands down. From my very first interactive data graphic about The Great One to the most recent visualization below on major league pitchers, I’ve learned a great deal from these Cartesian classics over the years. In this post I’ll show you how to make them even better than the standard ones in Tableau.

Recently, Shine Pulikathara published a scatterplot of NFL player heights and weights that included two marginal histograms – one for each axis. I tweeted that I liked it, and Lynn Cherny replied that it’s pretty common to see this kind of thing in R:

She’s right, and it turns out that it’s also a common convention with other statistical graphing platforms, like Matlab and Plotly. It’s called a Scatterplot with Marginal Histograms. While Tableau has scatterplots and histograms as standard chart types, it doesn’t automatically combine them for you into a single view. The goods news, though, is that it’s fairly easy to combine them using a dashboard with three sheets. There’s only one small trick to make the charts interact the way you want, which I’ll cover below. If you want to follow along, download 2015pitchingstats.xlsx.

First, here is the finished version, showing pitchers “skill” (Earned Run Average, or ERA) and “luck” (Runs Scored by their team, or RS) so far in the 2015 season:

Now, let’s consider the four easy steps to create a scatterplot with marginal histograms:

Step 1: Create the Three Sheets

This part is fairly straightforward – create a scatterplot and two histograms as three separate sheets in the same workbook. To create the scatterplot, drag ERA to Columns, RS to Rows, W% to Color, Player to Label, and then add two Average reference lines, like this:


Next, to create the first histogram, create a new sheet, click on the Measure (say, ERA), click Show Me in the top right, and then choose Histogram. Do the same in another new sheet with RS, but click the Rotate icon in the top icon bar to flip the RS histogram 90°. Notice that two new data fields appear in the Measures area: “ERA (bin)” and “RS (bin)”. Right click to edit these fields and change the “Size of bins” to be 0.25 and hide the axes.

histogram2 histogram1

Step 2: Add the Histogram Bin Dimensions to the Scatterplot Chart Detail

Without this step, you won’t be able to get the sheets to interact together in the dashboard. Go back to the scatterplot sheet you created in Step 1 and drag both “ERA (bin)” and “RS (bin)” to Detail. You should now see these two fields listed in the Marks card area:


Step 3: Add the Three Sheets to a Dashboard

Next, create a new dashboard and add the three sheets you created in Step 1. Aligning the histograms with the scatterplot is the one messy part of this method. Add blanks to the left and right of the ERA histogram, and above and below the RS histogram. Drag the blanks until the extreme bars of the histogram align with the extreme points of the scatterplot:


Step 4: Create Two Highlight Actions:

The last step is to get the sheets to interact with each other. There are lots of ways they could potentially interact, but here’s what I’d like to see happen:

  • When I hover my mouse cursor over any of the histogram bars, the corresponding circles on the scatterplot highlight
  • When I hover my mouse cursor over any of the scatterplot circles, the corresponding histogram bars highlight

To do this, create two new dashboard actions by clicking Dashboard > Actions > Add Action > Highlight, and fill out the dialog boxes as follows:

highlightaction1 highlightaction2

That’s it! For finishing touches, I added a title, lead-in paragraph, data source and last accessed note, four area annotations to define the four quadrants, and two mark annotations to call out points of interest. I also edited the two Average reference lines to uncheck “Show recalculated line for highlighted or selected data points”. This was strictly a matter of preference, and you may not decide to modify the reference lines in that way.

Other Variations

Here are a couple other variations that don’t involve the binning concept inherent in histograms, and therefore don’t required Step 2 above:

Scatterplot with Marginal Box-and-Whisker-Plots

Scatterplot with Marginal Hash Lines

Thanks for reading! I hope you found this helpful. Let me know if you have any further tips by leaving a comment. Also, I’m curious, which of the three variations – marginal historgrams, box plots, or hash lines – do you prefer?


Nature and Viz

2015 June 2
by Ben Jones

I spend a lot of my free time with my family in nature, and it occurred to me recently that there is something particularly captivating about data visualizations that resemble nature. It makes sense if you think about it. As a species, we’re awed and inspired by majestic landscapes, and we’re drawn in by the intricate patterns we see in the world around us.

It’s the same with data visualizations. We use words like “enlightening,” “stunning,” and “beautiful” to describe the really good ones. Some people tire of the use of these adjectives, but I think there’s a good reason they have found their application here. They describe our inner experience.

To illustrate the point, I chose five visualizations published to the new Tableau Public “Greatest Hits” gallery and found five corresponding images from nature.






Here is the same content posted to Slideshare:

I’m not suggesting we all go out and start making visualizations that look like sand dunes or tadpoles, as appealing they might look. And relatively few visualizations will work in “petri dish” form, though one about rapidly growing companies just might be in that small set. I’m just sharing my observation that I tend to stop and look closely at patterns that are familiar.

And we can take an important cue from nature as well. Many of nature’s patterns are both mesmerizing and incredibly informative, like the tree rings that encode important data about the life of a tree. If you can inspire as well as inform, then why wouldn’t you do both? Nature does.

Thanks for stopping by,

On Visualizing Data Well

2015 May 19
by Ben Jones

onwritingwellWilliam Zinsser, 92, died last week. Zinsser, author of On Writing Well, the classic guide to writing nonfiction, has been an inspiration to writers and aspiring writers since he first published his manual in 1976. Douglas Martin of the New York Times has written an excellent obituary on Zinsser.

1.5 million copies of On Writing Well have been sold to people all around the world who care about “getting the words right”, as Ernest Hemingway put it. I first read On Writing Well before launching this website four years ago, and I referred to it countless times while writing my first book, Communicating Data With Tableau (O’Reilly, 2014). Zinsser’s book taught me to respect sentences and words more than I had before. Or, to put it another way, it made me realize that my writing sucked. It was a harsh realization at the time, but I needed to know that upfront. I’d like to think that my writing sucks a little less thanks to Zinsser.

As I read On Writing Well it struck me that his advice for communicating well with words applies directly to the craft of communicating visually with data. His seven principles in Part I – The Transaction, Simplicity, Clutter, Style, The Audience, Words, and Usage – could be written about visualizing data as well.

Let’s call it On Visualizing Data Well:

1. The Transaction

Zinsser opened his classic book by teaching that “the product that any writer has to sell is not the subject being written about, but who he or she is.” (emphasis mine). The transaction, then, is a personal one – the reader is drawn in by the “enthusiasm of the writer for his field,” and the two most important qualities that result are “humanity and warmth.”

The same is true when someone examines the product created by a data visualizer. The humanity of the visualizer should shine though:

  • What does the person creating the visualization think about the topic?
  • Why does he or she care about this topic?
  • How does he or she feel about it?
  • Do you know something more about them, not just their topic?

Take another look at one of the most talked about visualizations of 2014, Periscopic’s animated U.S. Gun Deaths. Regardless of what you think about the design, form or aesthetic of the final product, you can’t help but feel the emotions of those who created it. Is there any doubt about what they’re trying to say, and more importantly, why they’re saying it? Kim Rees and her team brought their own humanity to the transaction:

2. Simplicity

Zinsser taught that “the secret of good writing is to strip every sentence to its cleanest components,” and that there are a “thousand and one adulterants that weaken the strength of a sentence.” Here’s an example he gives of “the clotted language of everyday American commerce:”

“The airline pilot who announces that he is presently anticipating experiencing considerable precipitation wouldn’t think of saying it may rain.”

It’s easy to laugh at this bloated phrase because we see it all the time, and we even fall prey to it ourselves. The same is true when we communicate with data. The lesson is that we shouldn’t overcomplicate the message.

This lesson is often misunderstood to mean that we should dumb down the message, or only choose simplistic messages in the first place. This interpretation is wrong. Just as a writer sometimes seeks to articulate a profound thought, we sometimes seek to show relationships that are complex. That’s okay, and we shouldn’t shrink from that challenge in the name of simplicity. But if there’s a clear way to show it, then we should show it clearly. In the words of Albert Einstein, “everything should be made as simple as possible, but not simpler.”

3. Clutter

Zinsser wrote “writing improves in direct ratio to the number of things we keep out of it that shouldn’t be there.” He opens his third chapter with a funny example from the annals of U.S. history:

“Consider what President Nixon’s aide John Dean accomplished in just one day of testimony on television during the Watergate hearings. The next day everyone in America was saying ‘at this point in time’ instead of ‘now’.”

His admonition is to “examine every word you put on paper.” When working with his students at Yale, Zinsser would “put brackets around every component in a piece of writing that wasn’t doing useful work.” Sound familiar? Edward Tufte’s notion of chartjunk is the same notion. Designers and artists celebrate the white space in their creations. Obviously we shouldn’t remove every pixel, just the ones that aren’t doing any work. The trick is knowing which is which.

4. Style

Next Zinsser addresses the objection that reducing a writing product to its simplest form leaves no room for style. He concedes that “simplicity carried to an extreme might seem to point to a style little more sophisticated than ‘Dick likes Jane’ and ‘See Spot run’.”

In data viz, the corollary to these preschool sentences is the bar chart. Simple, easy to understand, but no flair. We’re in familiar territory. It’s the never-ending “clarity vs. beauty” debate. But it’s a false dichotomy. Clarity and beauty are not mutually exclusive. Of course we can achieve both. Information visualization design firm Accurat does it all the time. Here’s an example of their work:


European subterranean veins by Accurat

How can we achieve both simplicity and style? Clarity and beauty? Zinsser’s advice for writers applies to us, too. There’s a reason his chapter on style follows the previous two. A singer with loads of personality who sings out of tune won’t sell records. A carpenter who adds bevels and carvings galore to a chair that doesn’t hold your weight won’t stay in business for long. “This is the problem of writers who set out deliberately to garnish their prose.” Zinsser uses the wood-working analogy to show us the way:

“Extending the metaphor of carpentry, it’s first necessary to be able to saw wood neatly and to drive nails. Later you can bevel the edges or add elegant finials, if that’s your taste. But you can never forget that you are practicing a craft that’s based on certain principles.”

To create a data visualization that is both clear and beautiful, we first must get the raw materials and basic proportions right. Only then we can add what Willard Cope Brinton calls “judicious embellishment of charts”. What’s judicious? Fortunately, as Zinsser puts it, “there is no style store”, and you’ll have to answer that for yourself. Your audience will also have an opinion on the matter.

5. The Audience

Speaking of audience, Zinsser addresses this critical element in the fifth chapter of his masterpiece. We often talk about “knowing your audience” in data viz, and user-centered design in product development. It’s a very popular topic. I even give similar platitudes in the first chapter of my own book.

But Zinsser gives what at first seems like shocking advice on this subject. He says:

“You are writing for yourself. Don’t try to visualize the great mass audience. There is no such audience – every reader is a different person.”

Only write for yourself and don’t even consider who will see your work? Really? He clarifies by differentiating between a mechanical act (“work hard to master the tools”) and a creative act (“the expressing of who you are”). If you lose someone through sloppy workmanship, then it’s your fault. If you lose someone because they don’t like what you have to say, don’t worry. “You are who you are, he is who he is, and either you’ll get along or you won’t.”

In other words, care about your audience’s ability to decipher your message, and get that part right, but don’t care about whether they’ll agree with you or like you. Just say what you need to say based on what you find in the data.

6. Words

Zinsser’s sixth chapter, entitled Words, deals with avoiding “cheap words, made-up words and cliché that have become so pervasive that a writer can hardly help using them.” His advice: “You must fight these phrases or you’ll sound like every hack.”

Do we have clichés in the world of data viz? Yes, we do. We all seek to imitate others in some way. The cliché in any field is just the tacky or ineffective element that people continue to use in spite of the fact that it’s bad, just because others use it. Think Periodic Tables and Subway Maps.

At the Tapestry Conference in 2014, Martin Wattenberg and Fernanda Viegas gave a presentation on genres in data visualization. They explained how we often use a shared language that gives our readers shortcuts to understanding. While this can be good, we are in danger of getting stuck in these genres, which can become formulaic.

Here’s their full presentation:

Wattenberg and Viegas say the key is awareness. Awareness of the elements of the genre or genres we’re in, and those elements to which we really shouldn’t adhere, because they don’t work or are tacky. We have to care enough to examine each element and root it out if it’s cliché, regardless of what our peers might be doing. According to Zinsser, “the only way to avoid it is to care deeply about words.” We also have to care about the data.

7. Usage

Zinsser’s last principle is about determining whether what’s new should be “ushered in” as accepted practice, or whether it should be “thrown out on [its] ear.” For any field to be vibrant and thriving, and for it to be at all fun, it must be fluid and not static. Is data viz fluid, or is it static?

Just as there is “no king to establish the King’s English,” there is no anointed panel to accept or reject new methods or tools in data viz, at least not that I know of. We all get to cast a vote by what we use. One of the chief values that innovators like Accurat and Periscopic bring to the field of data viz is a fresh take on this business of communicating with data. We all get to observe one another’s work, and if we keep the principles in mind, the history books will determine what gets kept and what gets left behind.

I’m confident that we’ll get the pixels right.

Thanks for reading,

Sunset Color Palettes

2015 March 20
by Ben Jones

Earlier this week I was visiting family in Ventura County, California. I had a nice view of the sunset one evening, and I noticed how much the color palette of the sky changed in seven minute increments. I used an iPhone to snap the pictures, Instant Eyedropper for Windows to pull out the hex codes of the uploaded images, and the website to create the color squares beneath. Enjoy!


Tapestry 2015: Seven Data Story Types

2015 March 4
by Ben Jones

Here is my 2015 Tapestry Conference presentation using Freedom of the Press data (Excel, source). You can follow the live feed at 10:30am ET on March 4th to hear the presentation.

Avoiding Data Pitfalls, Part 2: Fooled by Small Samples

2015 January 22
tags: ,
by Ben Jones

adpPrevious – “Part 1: Gaps Between Data and Reality

I’m reading “Thinking, Fast and Slow” by Nobel Prize winner Daniel Kahneman – a frighteningly interesting book about cognitive biases and “heursitics” (rules of thumb) in decision making. If you deal with numbers at all and haven’t read it yet, you should. In it he refers to an article by Howard Wainer and Harris L. Zwerling called “Evidence That Smaller Schools Do Not Improve Student Achievement” that talks about kidney cancer rates, of all things.

Kidney cancer is a relatively rare form of cancer, accounting for only ~3% of all adult cancers. If you look at kidney cancer rates by county in the U.S. an interesting pattern emerges, as he describes on page 109 of his book:

The counties in which the incidence of kidney cancer is lowest are mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West. p.109

What do you make of this? He goes on to list some of the reasons people come up with in an attempt to rationalize this fact: residents of rural counties have access to fresh food, lack of air pollution, etc. Did these explanations come to your mind, too? He then points out the following:

“Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.”

Again, people come up with various theories to explain this fact: rural counties have relatively high poverty rates, high fat diet, lack of access to medication, etc.

But wait – what’s going on here? Rural counties have both the highest and the lowest kidney cancer rates? What gives?

Insensitivity to Sample Size

This is a great example of a bias known as “insensitivity to sample size“. It goes like this: when we deal with data, we don’t take into account sample size when we think about probability. These rural counties have relatively few people, and as such, they are more likely to have either very high or very low incidence rates. Why? Because the variance of the mean is proportional to the sample size. The smaller the sample, the greater the variance (proof).

I found the 2007-2011 kidney cancer rate data and the 2010 population data for each U.S. county, and created this interactive graphic to illustrate the point that Kahneman, Wainer and Zwerlink are trying to make:

Notice a few things in the dashboard above:

  1. In the choropleth map, the darkest orange (high rates relative to the overall U.S. rate) and the darkest blue (low rates relative to the overall U.S. rate) counties are often right next to each other
  2. In the scatterplot below the map, the marks form a funnel shape, with less populous counties (to the left) more likely to deviate from the reference line (the overall US rate), and more populous counties like Chicago, L.A. and New York more likely to be close to the overall reference line
  3. If you hover over a county with a small population, you will notice that the average number of cases per year is extremely low – 4 cases or less sometimes. A small deviation – even just 1 or 2 cases – in a subsequent year will shoot a county from the bottom of the list to the top, or vice versa

Other Examples

Where else does “insensitivity to sample size” come up? My colleague Dash Davidson suggested “streaks” in sports, which can often be just a “clustering illusion“. We look at a brief sample of a player’s overall performance and notice temporary periods of greatness. We should expect to see such streaks for even mediocre players. Remember Linsanity? Similarly, small samples make some rich and others poor in the world of gambling. You may have a good day at the tables, but if you keep playing, eventually the house will win. And in investing, “diversification” is nothing more than a strategy to minimize exposure to extreme downside risks of individual securities (think Enron).

Kahneman and his long-time partner Amos Tversky even showed that 84 professional psychologists were subject to this very same bias, so experts are not immune.

Avoiding this Pitfall

So what do we do about it? How do we make sure we don’t fall into the pitfall known as “insensitivity to sample size”?

  • Be aware of any sampling involved in the data we are analyzing
  • Understand that the smaller the sample size, the more likely we will see a rate or statistic that deviates significantly from the population
  • Before forming theories about why a particular sample deviates from the population in some way, first consider that it may just be noise and chance
  • Visualize the rate or statistic associated with groups of varying size in a scatterplot. If you see the telltale funnel shape, then you know not to be fooled

In Conclusion

lostsprings_0The point of the original article by Wainer and Zwerling is that smaller schools are apt to yield extreme test scores by virtue of the fact that there aren’t enough students in small schools to “even out” the scores. A random cluster of extremely good (or bad) performers can sway a small school’s scores. At a very big school, yes a few bad results will still affect the overall mean, but not nearly as much.

Here’s another way to think of it: if Daniel Kahneman ever moved to Lost Springs, Wyoming, then half of the town’s population would be Nobel Prize winners. And if you think that moving there would increase your chances of winning the Nobel Prize, or that it’s “in the water” or some other such reason, then you’re suffering from a severe case of insensitivity to sample size.

Do any other examples of this pitfall come to mind? Ever fall into it yourself? Share by leaving a comment below.

Thanks for stopping by,

Avoiding Data Pitfalls, Part 1: Gaps Between Data and Reality

2015 January 6
by Ben Jones

adpHappy New Year! In 2015 I’ll be publishing a periodic series of blog posts entitled “Avoiding Data Pitfalls” where I’ll suggest ways to avoid common errors people make when working with data. The pitfalls range from philosophical to technical, and from analytical to visual. I’m familiar with these pitfalls because I’ve fallen into them myself, some of them repeatedly. If I’m the only one that these posts keep out of trouble, then it’ll be worth it.

We fall head first into a pitfall when we fail to remember that a gap exists between our data and reality. Do people really fail to realize this? I see (and make) this mistake quite often. I’m starting with this one because it’s foundational, dealing with the grounds and limits of our knowledge. How does it work?

It works like this: we get some data, and run with it, never stopping to think about where it came from, who collected it, what it tells us, and, importantly, what it doesn’t tell us.

It’s easy when working with data to treat it as reality rather than data collected about reality. Here are some examples:

  • It’s not crime, it’s reported crime.
  • It’s not the number of meteor strikes, it’s the number of recorded meteor strikes.
  • It’s not the outer diameter of a mechanical part, it’s the measured outer diameter.
  • It’s not how the public feels about a controversial topic, it’s how survey respondents are willing to say they feel.
  • It’s not how many people suffer from a particular disease, it’s how many times a doctor diagnosed people with a particular disease.

You get the picture. This distinction may seem like a technicality, and sometimes it is (the number of home runs Hank Aaron “reportedly” hit?) but it can also be a big deal. Let’s see an example of how missing it can lead us astray:

Example #1: Actual vs. Recorded Earthquakes

Consider earthquakes. The USGS provides a Earthquake Archive Search form that lets visitors obtain a list of historical earthquakes that meet various criteria. A query of earthquakes of magnitude 6.0 and above from 1900 to 2013 yields the following, somewhat alarming, line plot:

Are we really to believe that earthquakes have increased in frequency by this much? Obviously not. The world that measured and collected earthquakes in the early 20th century was very different than the one that did so in the last decade. Comparisons across decades, and even within some decades (the 1960s), aren’t “apples-to-apples” due to the changes in technology.

If we separate the line plot by magnitude and add annotations that describe advances in seismology, we see that the rise is only in the smaller group (magnitude 6.0-6.9), and coincides with dramatic improvements in instrumentation:

It’s safe to say that the rise in recorded earthquakes is primarily due to the improvements in our ability to detect them. There may also be an upward trend in actual earthquakes over this time, but our data doesn’t reveal it to us due to the continual changes in the quality of the measurement system. When it comes to earthquakes, the gap between data and reality is getting smaller. The problem is that the “data-reality gap” is changing over the time period we’re considering. And it’s hard to know for sure exactly how many magnitude 6.0 earthquakes we missed in any particular year.

Let’s look at another example – counting bicycles that cross a bridge.

Example #2: Counting Bicycles

Everyday on my way to work I walk across the Fremont Bridge. It’s a bright blue and orange double-leaf bascule bridge that was built in 1917. Since it sits so close to the water, it opens on average 35 times a day, which supposedly makes it the most opened drawbridge in the United States. Here’s what it looks like:


The City of Seattle Department of Transportation has installed two “inductive loops” on the pedestrian/bicycle pathways of the bridge that are designed to count the number of bicycles that cross the bridge in either direction, all day every day. The city also provides hourly counts going back to October 2, 2012 at Downloading this data and visualizing it yields the following timeline:

I showed this data at a recent luncheon of the Puget Sound Research Forum, and asked what the attendees thought of these spikes. I honestly didn’t know what had caused them. A few ideas sprang from the crowd – was it “bike to work day”, really good weather, or maybe there was some big bike race or club event? Notice how each of these ideas is based on the assumption that there actually were more bikes that crossed the bridge on those days.

David Bauer was in the audience and found the answer for us: equipment error. The counters just glitched for a few hours on both days. You can read all the details of these anomalous readings and the correspondence between a local blogger and a city employee at the Seattle Bike Blog. I won’t repeat it here, but suffice it to say measuring things can be tricky. Turns out a low battery was the culprit.

Let’s consider one last example: counting Ebola deaths.

Example #3: Ebola Deaths

This past year, the whole world watched in horror as Ebola ravaged West Africa. It’s still happening, by the way, we’ve just stopped noticing. In any case, the WHO provides data about fatalities in weekly situation reports. I had an interesting discussion on twitter with Alex McDonnell about this data. In it he referred to errors in the WHO reports.

Errors? About one of the world’s most closely followed topics? From one of the world’s most respected organizations? You bet.

Let’s take a look at a timeline of cumulative deaths from Ebola as reported by WHO and CDC. Notice the drops in cumulative death counts – the handful of times when the lines slope down:

Of course it makes perfect sense: the task of diagnosing disease and ascertaining causes of death in some of the more remote locations, where the equipment and staff are often severely limited, must be incredibly difficult. If you read the WHO situation report you’ll notice that they classify cases as “confirmed”, “probable” and “suspected”. It’s not always so obvious. Here are the criteria:


The WHO and CDC actually do a very good job of speaking clearly about “reported” cases (the Dec 31st WHO situation report includes the word “reported” no less than 61 times).

I don’t bring up this example to criticize the people or organizations involved with fighting and documenting the Ebola outbreak. Far from it. This example merely demonstrates that the gap between data and reality can exist even when the stakes are high. Classifying diseases and deaths in chaotic conditions can be tricky business indeed.

How to Avoid Confusing Data with Reality

Notice that in these three examples – 1. earthquakes (a dubious trend), 2. bicycle counting (a spike or outlier), and 3. Ebola deaths (a downward slope in a cumulative line plot) – something in the view of the data itself alerted us to a potential “data-reality gap”. Visualizing the data can be one of the best ways to find problems with it.

Earlier in the game, though, it helps to remind ourselves that every data point that exists was collected, stored, accessed, etc, via imperfect processes. The more we know about these process – the equipment used, the protocol followed, the people involved, the steps they took, their motivations – the better equipped we will be to assess the “data-reality gap”.

Here are six seven suggestions to help you avoid confusing data with reality:

  1. Clearly understand the operational definitions of all metrics
  2. Draw the data collection steps as a process flow diagram
  3. Understand the limitations and inaccuracies of each step in the process
  4. Identify any changes in method or equipment over time
  5. Seek to understand the motives of the people collecting and reporting. Could there be any biases or incentives involved?
  6. Visualize the data and investigate any shifts, outliers and trends for possible discrepancies
  7. Think carefully about data formatting, processing, and transformations (thanks Keith!)

Ultimately, each data collection activity is unique, and there are too many possible sources of error to list them all. These are some typical ones that I’ve come in contact with, and you may have your own suggestions. I’d love to hear them.

In Conclusion

At the core of this first “data pitfall” is our attitude toward data. Do we arrogantly or naively see ourselves as experts on a topic as soon as we get our hands on some data, or do we humbly realize that our knowledge is imperfect, and we may not know the full story?

We can’t ever perfectly know the “data-reality gap” because that would require perfect data. What we can do, though, is seek to identify any gaps that may exist, and take that into account when we use data to form our opinions.

Thanks for reading,

Next: Part 2: Fooled by Small Samples

Kobe Vs. Jordan: what defines a “career”?

2014 December 16
by Ben Jones

Since Kobe Bryant surpassed Michael Jordan in “career points” scored this past Sunday, much has been written about who is better, etc. If a) you’ve heard enough already, or b) you don’t care about sports at all, then you may stop reading now. There is a finer point about “data definitions”, but I’ll leave that for later. First, back to basketball player egos:

There is one thing about this debate that puzzles me: why regular season stats alone are used in the “career” totals. I can understand leaving out the All-Star game stats, as the All-Star game is largely a meaningless pick-up game (and crazy points-fest). But playoffs? NBA basketball is all about the playoffs. That’s where the games really matter, the real drama unfolds, and Hall-of-Fame “careers” are made.

Bryant and Jordan have both had considerable success in the playoffs, but somehow the points they scored during these crucial games don’t count to their career totals? I don’t get it. If there’s a compelling reason to leave out playoff stats from “career totals”, I’m unaware of it.

And if you take into consideration playoff point totals when tallying career stats, Bryant hasn’t passed Jordan quite yet:

Of course you can make the argument that this just postpones the inevitable by a few weeks, as Bryant will soon surpass Jordan in total points scored including the playoffs. So who cares?

To me, it just shows how we can get wrapped up in debates about numbers without stopping to consider the “data definitions” – exactly what are we comparing? How is this data collected, what is included and what is not included? Does it even make sense?

Did you know all those dramatic game-winning shots these two players made in crunch time aren’t even included in their career totals?

Like this one:

And these ones:

I bet you didn’t. If you did, can you defend it? I can’t.

At some point in the next month or so, Kobe will pass Jordan in the amount of total points he has scored in both the regular season and the playoffs combined. There will be no hugging at center court, no interviews, no headlines or blog posts about the stats. That’s not such a bad thing, though, I guess.

As for who is the better scorer? Jordan scored at a higher rate, but took two extended breaks during his career. Bryant took no such breaks, and only had one considerable setback to injury, so he racked up points at a slower, albeit unabated pace. And for all the flak Bryant has taken for being a ball hog, try comparing his career Assists with Jordan’s. He passed Jordan in assists two years ago.

Details, details…

Thanks for indulging me,