William Zinsser, 92, died last week. Zinsser, author of On Writing Well, the classic guide to writing nonfiction, has been an inspiration to writers and aspiring writers since he first published his manual in 1976. Douglas Martin of the New York Times has written an excellent obituary on Zinsser.
1.5 million copies of On Writing Well have been sold to people all around the world who care about “getting the words right”, as Ernest Hemingway put it. I first read On Writing Well before launching this website four years ago, and I referred to it countless times while writing my first book, Communicating Data With Tableau (O’Reilly, 2014). Zinsser’s book taught me to respect sentences and words more than I had before. Or, to put it another way, it made me realize that my writing sucked. It was a harsh realization at the time, but I needed to know that upfront. I’d like to think that my writing sucks a little less thanks to Zinsser.
As I read On Writing Well it struck me that his advice for communicating well with words applies directly to the craft of communicating visually with data. His seven principles in Part I – The Transaction, Simplicity, Clutter, Style, The Audience, Words, and Usage – could be written about visualizing data as well.
Let’s call it On Visualizing Data Well:
1. The Transaction
Zinsser opened his classic book by teaching that “the product that any writer has to sell is not the subject being written about, but who he or she is.” (emphasis mine). The transaction, then, is a personal one – the reader is drawn in by the “enthusiasm of the writer for his field,” and the two most important qualities that result are “humanity and warmth.”
The same is true when someone examines the product created by a data visualizer. The humanity of the visualizer should shine though:
- What does the person creating the visualization think about the topic?
- Why does he or she care about this topic?
- How does he or she feel about it?
- Do you know something more about them, not just their topic?
Take another look at one of the most talked about visualizations of 2014, Periscopic’s animated U.S. Gun Deaths. Regardless of what you think about the design, form or aesthetic of the final product, you can’t help but feel the emotions of those who created it. Is there any doubt about what they’re trying to say, and more importantly, why they’re saying it? Kim Rees and her team brought their own humanity to the transaction:
Zinsser taught that “the secret of good writing is to strip every sentence to its cleanest components,” and that there are a “thousand and one adulterants that weaken the strength of a sentence.” Here’s an example he gives of “the clotted language of everyday American commerce:”
“The airline pilot who announces that he is presently anticipating experiencing considerable precipitation wouldn’t think of saying it may rain.”
It’s easy to laugh at this bloated phrase because we see it all the time, and we even fall prey to it ourselves. The same is true when we communicate with data. The lesson is that we shouldn’t overcomplicate the message.
This lesson is often misunderstood to mean that we should dumb down the message, or only choose simplistic messages in the first place. This interpretation is wrong. Just as a writer sometimes seeks to articulate a profound thought, we sometimes seek to show relationships that are complex. That’s okay, and we shouldn’t shrink from that challenge in the name of simplicity. But if there’s a clear way to show it, then we should show it clearly. In the words of Albert Einstein, “everything should be made as simple as possible, but not simpler.”
Zinsser wrote “writing improves in direct ratio to the number of things we keep out of it that shouldn’t be there.” He opens his third chapter with a funny example from the annals of U.S. history:
“Consider what President Nixon’s aide John Dean accomplished in just one day of testimony on television during the Watergate hearings. The next day everyone in America was saying ‘at this point in time’ instead of ‘now’.”
His admonition is to “examine every word you put on paper.” When working with his students at Yale, Zinsser would “put brackets around every component in a piece of writing that wasn’t doing useful work.” Sound familiar? Edward Tufte’s notion of chartjunk is the same notion. Designers and artists celebrate the white space in their creations. Obviously we shouldn’t remove every pixel, just the ones that aren’t doing any work. The trick is knowing which is which.
Next Zinsser addresses the objection that reducing a writing product to its simplest form leaves no room for style. He concedes that “simplicity carried to an extreme might seem to point to a style little more sophisticated than ‘Dick likes Jane’ and ‘See Spot run’.”
In data viz, the corollary to these preschool sentences is the bar chart. Simple, easy to understand, but no flair. We’re in familiar territory. It’s the never-ending “clarity vs. beauty” debate. But it’s a false dichotomy. Clarity and beauty are not mutually exclusive. Of course we can achieve both. Information visualization design firm Accurat does it all the time. Here’s an example of their work:
How can we achieve both simplicity and style? Clarity and beauty? Zinsser’s advice for writers applies to us, too. There’s a reason his chapter on style follows the previous two. A singer with loads of personality who sings out of tune won’t sell records. A carpenter who adds bevels and carvings galore to a chair that doesn’t hold your weight won’t stay in business for long. “This is the problem of writers who set out deliberately to garnish their prose.” Zinsser uses the wood-working analogy to show us the way:
“Extending the metaphor of carpentry, it’s first necessary to be able to saw wood neatly and to drive nails. Later you can bevel the edges or add elegant finials, if that’s your taste. But you can never forget that you are practicing a craft that’s based on certain principles.”
To create a data visualization that is both clear and beautiful, we first must get the raw materials and basic proportions right. Only then we can add what Willard Cope Brinton calls “judicious embellishment of charts”. What’s judicious? Fortunately, as Zinsser puts it, “there is no style store”, and you’ll have to answer that for yourself. Your audience will also have an opinion on the matter.
5. The Audience
Speaking of audience, Zinsser addresses this critical element in the fifth chapter of his masterpiece. We often talk about “knowing your audience” in data viz, and user-centered design in product development. It’s a very popular topic. I even give similar platitudes in the first chapter of my own book.
But Zinsser gives what at first seems like shocking advice on this subject. He says:
“You are writing for yourself. Don’t try to visualize the great mass audience. There is no such audience – every reader is a different person.”
Only write for yourself and don’t even consider who will see your work? Really? He clarifies by differentiating between a mechanical act (“work hard to master the tools”) and a creative act (“the expressing of who you are”). If you lose someone through sloppy workmanship, then it’s your fault. If you lose someone because they don’t like what you have to say, don’t worry. “You are who you are, he is who he is, and either you’ll get along or you won’t.”
In other words, care about your audience’s ability to decipher your message, and get that part right, but don’t care about whether they’ll agree with you or like you. Just say what you need to say based on what you find in the data.
Zinsser’s sixth chapter, entitled Words, deals with avoiding “cheap words, made-up words and cliché that have become so pervasive that a writer can hardly help using them.” His advice: “You must fight these phrases or you’ll sound like every hack.”
Do we have clichés in the world of data viz? Yes, we do. We all seek to imitate others in some way. The cliché in any field is just the tacky or ineffective element that people continue to use in spite of the fact that it’s bad, just because others use it. Think Periodic Tables and Subway Maps.
At the Tapestry Conference in 2014, Martin Wattenberg and Fernanda Viegas gave a presentation on genres in data visualization. They explained how we often use a shared language that gives our readers shortcuts to understanding. While this can be good, we are in danger of getting stuck in these genres, which can become formulaic.
Here’s their full presentation:
Wattenberg and Viegas say the key is awareness. Awareness of the elements of the genre or genres we’re in, and those elements to which we really shouldn’t adhere, because they don’t work or are tacky. We have to care enough to examine each element and root it out if it’s cliché, regardless of what our peers might be doing. According to Zinsser, “the only way to avoid it is to care deeply about words.” We also have to care about the data.
Zinsser’s last principle is about determining whether what’s new should be “ushered in” as accepted practice, or whether it should be “thrown out on [its] ear.” For any field to be vibrant and thriving, and for it to be at all fun, it must be fluid and not static. Is data viz fluid, or is it static?
Just as there is “no king to establish the King’s English,” there is no anointed panel to accept or reject new methods or tools in data viz, at least not that I know of. We all get to cast a vote by what we use. One of the chief values that innovators like Accurat and Periscopic bring to the field of data viz is a fresh take on this business of communicating with data. We all get to observe one another’s work, and if we keep the principles in mind, the history books will determine what gets kept and what gets left behind.
I’m confident that we’ll get the pixels right.
Thanks for reading,
Earlier this week I was visiting family in Ventura County, California. I had a nice view of the sunset one evening, and I noticed how much the color palette of the sky changed in seven minute increments. I used an iPhone to snap the pictures, Instant Eyedropper for Windows to pull out the hex codes of the uploaded images, and the website Color-hex.com to create the color squares beneath. Enjoy!
Previous – “Part 1: Gaps Between Data and Reality“
I’m reading “Thinking, Fast and Slow” by Nobel Prize winner Daniel Kahneman – a frighteningly interesting book about cognitive biases and “heursitics” (rules of thumb) in decision making. If you deal with numbers at all and haven’t read it yet, you should. In it he refers to an article by Howard Wainer and Harris L. Zwerling called “Evidence That Smaller Schools Do Not Improve Student Achievement” that talks about kidney cancer rates, of all things.
Kidney cancer is a relatively rare form of cancer, accounting for only ~3% of all adult cancers. If you look at kidney cancer rates by county in the U.S. an interesting pattern emerges, as he describes on page 109 of his book:
The counties in which the incidence of kidney cancer is lowest are mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West. p.109
What do you make of this? He goes on to list some of the reasons people come up with in an attempt to rationalize this fact: residents of rural counties have access to fresh food, lack of air pollution, etc. Did these explanations come to your mind, too? He then points out the following:
“Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.”
Again, people come up with various theories to explain this fact: rural counties have relatively high poverty rates, high fat diet, lack of access to medication, etc.
But wait – what’s going on here? Rural counties have both the highest and the lowest kidney cancer rates? What gives?
Insensitivity to Sample Size
This is a great example of a bias known as “insensitivity to sample size“. It goes like this: when we deal with data, we don’t take into account sample size when we think about probability. These rural counties have relatively few people, and as such, they are more likely to have either very high or very low incidence rates. Why? Because the variance of the mean is proportional to the sample size. The smaller the sample, the greater the variance (proof).
I found the 2007-2011 kidney cancer rate data and the 2010 population data for each U.S. county, and created this interactive graphic to illustrate the point that Kahneman, Wainer and Zwerlink are trying to make:
Notice a few things in the dashboard above:
- In the choropleth map, the darkest orange (high rates relative to the overall U.S. rate) and the darkest blue (low rates relative to the overall U.S. rate) counties are often right next to each other
- In the scatterplot below the map, the marks form a funnel shape, with less populous counties (to the left) more likely to deviate from the reference line (the overall US rate), and more populous counties like Chicago, L.A. and New York more likely to be close to the overall reference line
- If you hover over a county with a small population, you will notice that the average number of cases per year is extremely low – 4 cases or less sometimes. A small deviation – even just 1 or 2 cases – in a subsequent year will shoot a county from the bottom of the list to the top, or vice versa
Where else does “insensitivity to sample size” come up? My colleague Dash Davidson suggested “streaks” in sports, which can often be just a “clustering illusion“. We look at a brief sample of a player’s overall performance and notice temporary periods of greatness. We should expect to see such streaks for even mediocre players. Remember Linsanity? Similarly, small samples make some rich and others poor in the world of gambling. You may have a good day at the tables, but if you keep playing, eventually the house will win. And in investing, “diversification” is nothing more than a strategy to minimize exposure to extreme downside risks of individual securities (think Enron).
Kahneman and his long-time partner Amos Tversky even showed that 84 professional psychologists were subject to this very same bias, so experts are not immune.
Avoiding this Pitfall
So what do we do about it? How do we make sure we don’t fall into the pitfall known as “insensitivity to sample size”?
- Be aware of any sampling involved in the data we are analyzing
- Understand that the smaller the sample size, the more likely we will see a rate or statistic that deviates significantly from the population
- Before forming theories about why a particular sample deviates from the population in some way, first consider that it may just be noise and chance
- Visualize the rate or statistic associated with groups of varying size in a scatterplot. If you see the telltale funnel shape, then you know not to be fooled
The point of the original article by Wainer and Zwerling is that smaller schools are apt to yield extreme test scores by virtue of the fact that there aren’t enough students in small schools to “even out” the scores. A random cluster of extremely good (or bad) performers can sway a small school’s scores. At a very big school, yes a few bad results will still affect the overall mean, but not nearly as much.
Here’s another way to think of it: if Daniel Kahneman ever moved to Lost Springs, Wyoming, then half of the town’s population would be Nobel Prize winners. And if you think that moving there would increase your chances of winning the Nobel Prize, or that it’s “in the water” or some other such reason, then you’re suffering from a severe case of insensitivity to sample size.
Do any other examples of this pitfall come to mind? Ever fall into it yourself? Share by leaving a comment below.
Thanks for stopping by,
Happy New Year! In 2015 I’ll be publishing a periodic series of blog posts entitled “Avoiding Data Pitfalls” where I’ll suggest ways to avoid common errors people make when working with data. The pitfalls range from philosophical to technical, and from analytical to visual. I’m familiar with these pitfalls because I’ve fallen into them myself, some of them repeatedly. If I’m the only one that these posts keep out of trouble, then it’ll be worth it.
We fall head first into a pitfall when we fail to remember that a gap exists between our data and reality. Do people really fail to realize this? I see (and make) this mistake quite often. I’m starting with this one because it’s foundational, dealing with the grounds and limits of our knowledge. How does it work?
It works like this: we get some data, and run with it, never stopping to think about where it came from, who collected it, what it tells us, and, importantly, what it doesn’t tell us.
It’s easy when working with data to treat it as reality rather than data collected about reality. Here are some examples:
- It’s not crime, it’s reported crime.
- It’s not the number of meteor strikes, it’s the number of recorded meteor strikes.
- It’s not the outer diameter of a mechanical part, it’s the measured outer diameter.
- It’s not how the public feels about a controversial topic, it’s how survey respondents are willing to say they feel.
- It’s not how many people suffer from a particular disease, it’s how many times a doctor diagnosed people with a particular disease.
You get the picture. This distinction may seem like a technicality, and sometimes it is (the number of home runs Hank Aaron “reportedly” hit?) but it can also be a big deal. Let’s see an example of how missing it can lead us astray:
Example #1: Actual vs. Recorded Earthquakes
Consider earthquakes. The USGS provides a Earthquake Archive Search form that lets visitors obtain a list of historical earthquakes that meet various criteria. A query of earthquakes of magnitude 6.0 and above from 1900 to 2013 yields the following, somewhat alarming, line plot:
Are we really to believe that earthquakes have increased in frequency by this much? Obviously not. The world that measured and collected earthquakes in the early 20th century was very different than the one that did so in the last decade. Comparisons across decades, and even within some decades (the 1960s), aren’t “apples-to-apples” due to the changes in technology.
If we separate the line plot by magnitude and add annotations that describe advances in seismology, we see that the rise is only in the smaller group (magnitude 6.0-6.9), and coincides with dramatic improvements in instrumentation:
It’s safe to say that the rise in recorded earthquakes is primarily due to the improvements in our ability to detect them. There may also be an upward trend in actual earthquakes over this time, but our data doesn’t reveal it to us due to the continual changes in the quality of the measurement system. When it comes to earthquakes, the gap between data and reality is getting smaller. The problem is that the “data-reality gap” is changing over the time period we’re considering. And it’s hard to know for sure exactly how many magnitude 6.0 earthquakes we missed in any particular year.
Let’s look at another example – counting bicycles that cross a bridge.
Example #2: Counting Bicycles
Everyday on my way to work I walk across the Fremont Bridge. It’s a bright blue and orange double-leaf bascule bridge that was built in 1917. Since it sits so close to the water, it opens on average 35 times a day, which supposedly makes it the most opened drawbridge in the United States. Here’s what it looks like:
The City of Seattle Department of Transportation has installed two “inductive loops” on the pedestrian/bicycle pathways of the bridge that are designed to count the number of bicycles that cross the bridge in either direction, all day every day. The city also provides hourly counts going back to October 2, 2012 at data.seattle.gov. Downloading this data and visualizing it yields the following timeline:
I showed this data at a recent luncheon of the Puget Sound Research Forum, and asked what the attendees thought of these spikes. I honestly didn’t know what had caused them. A few ideas sprang from the crowd – was it “bike to work day”, really good weather, or maybe there was some big bike race or club event? Notice how each of these ideas is based on the assumption that there actually were more bikes that crossed the bridge on those days.
David Bauer was in the audience and found the answer for us: equipment error. The counters just glitched for a few hours on both days. You can read all the details of these anomalous readings and the correspondence between a local blogger and a city employee at the Seattle Bike Blog. I won’t repeat it here, but suffice it to say measuring things can be tricky. Turns out a low battery was the culprit.
Let’s consider one last example: counting Ebola deaths.
Example #3: Ebola Deaths
This past year, the whole world watched in horror as Ebola ravaged West Africa. It’s still happening, by the way, we’ve just stopped noticing. In any case, the WHO provides data about fatalities in weekly situation reports. I had an interesting discussion on twitter with Alex McDonnell about this data. In it he referred to errors in the WHO reports.
Errors? About one of the world’s most closely followed topics? From one of the world’s most respected organizations? You bet.
Let’s take a look at a timeline of cumulative deaths from Ebola as reported by WHO and CDC. Notice the drops in cumulative death counts – the handful of times when the lines slope down:
Of course it makes perfect sense: the task of diagnosing disease and ascertaining causes of death in some of the more remote locations, where the equipment and staff are often severely limited, must be incredibly difficult. If you read the WHO situation report you’ll notice that they classify cases as “confirmed”, “probable” and “suspected”. It’s not always so obvious. Here are the criteria:
The WHO and CDC actually do a very good job of speaking clearly about “reported” cases (the Dec 31st WHO situation report includes the word “reported” no less than 61 times).
I don’t bring up this example to criticize the people or organizations involved with fighting and documenting the Ebola outbreak. Far from it. This example merely demonstrates that the gap between data and reality can exist even when the stakes are high. Classifying diseases and deaths in chaotic conditions can be tricky business indeed.
How to Avoid Confusing Data with Reality
Notice that in these three examples – 1. earthquakes (a dubious trend), 2. bicycle counting (a spike or outlier), and 3. Ebola deaths (a downward slope in a cumulative line plot) – something in the view of the data itself alerted us to a potential “data-reality gap”. Visualizing the data can be one of the best ways to find problems with it.
Earlier in the game, though, it helps to remind ourselves that every data point that exists was collected, stored, accessed, etc, via imperfect processes. The more we know about these process – the equipment used, the protocol followed, the people involved, the steps they took, their motivations – the better equipped we will be to assess the “data-reality gap”.
six seven suggestions to help you avoid confusing data with reality:
- Clearly understand the operational definitions of all metrics
- Draw the data collection steps as a process flow diagram
- Understand the limitations and inaccuracies of each step in the process
- Identify any changes in method or equipment over time
- Seek to understand the motives of the people collecting and reporting. Could there be any biases or incentives involved?
- Visualize the data and investigate any shifts, outliers and trends for possible discrepancies
- Think carefully about data formatting, processing, and transformations (thanks Keith!)
Ultimately, each data collection activity is unique, and there are too many possible sources of error to list them all. These are some typical ones that I’ve come in contact with, and you may have your own suggestions. I’d love to hear them.
At the core of this first “data pitfall” is our attitude toward data. Do we arrogantly or naively see ourselves as experts on a topic as soon as we get our hands on some data, or do we humbly realize that our knowledge is imperfect, and we may not know the full story?
We can’t ever perfectly know the “data-reality gap” because that would require perfect data. What we can do, though, is seek to identify any gaps that may exist, and take that into account when we use data to form our opinions.
Thanks for reading,
Since Kobe Bryant surpassed Michael Jordan in “career points” scored this past Sunday, much has been written about who is better, etc. If a) you’ve heard enough already, or b) you don’t care about sports at all, then you may stop reading now. There is a finer point about “data definitions”, but I’ll leave that for later. First, back to basketball player egos:
There is one thing about this debate that puzzles me: why regular season stats alone are used in the “career” totals. I can understand leaving out the All-Star game stats, as the All-Star game is largely a meaningless pick-up game (and crazy points-fest). But playoffs? NBA basketball is all about the playoffs. That’s where the games really matter, the real drama unfolds, and Hall-of-Fame “careers” are made.
Bryant and Jordan have both had considerable success in the playoffs, but somehow the points they scored during these crucial games don’t count to their career totals? I don’t get it. If there’s a compelling reason to leave out playoff stats from “career totals”, I’m unaware of it.
And if you take into consideration playoff point totals when tallying career stats, Bryant hasn’t passed Jordan quite yet:
Of course you can make the argument that this just postpones the inevitable by a few weeks, as Bryant will soon surpass Jordan in total points scored including the playoffs. So who cares?
To me, it just shows how we can get wrapped up in debates about numbers without stopping to consider the “data definitions” – exactly what are we comparing? How is this data collected, what is included and what is not included? Does it even make sense?
Did you know all those dramatic game-winning shots these two players made in crunch time aren’t even included in their career totals?
Like this one:
And these ones:
I bet you didn’t. If you did, can you defend it? I can’t.
At some point in the next month or so, Kobe will pass Jordan in the amount of total points he has scored in both the regular season and the playoffs combined. There will be no hugging at center court, no interviews, no headlines or blog posts about the stats. That’s not such a bad thing, though, I guess.
As for who is the better scorer? Jordan scored at a higher rate, but took two extended breaks during his career. Bryant took no such breaks, and only had one considerable setback to injury, so he racked up points at a slower, albeit unabated pace. And for all the flak Bryant has taken for being a ball hog, try comparing his career Assists with Jordan’s. He passed Jordan in assists two years ago.
Thanks for indulging me,
In June of this year I published my first book with O’Reilly Media called “Communicating Data with Tableau”. It has been great to hear from readers around the world, and I’m grateful for the reviews that have been published.
Here is a sampler that includes the entirety of Chapter 1 (entitled “Communicating Data”) in pdf format, freely available for anyone to download and share (click to download the pdf):
In Chapter 1, I share my thoughts on the concept of creating and sharing data visualizations as a particular form of the communication process. Thinking of it that way has been very helpful for me, because it frames the activity as an attempt to affect the minds of others.
Just like other types of communication such speech and even body language, its success depends on many factors, some obvious and some subtle. It is subject to problems of various kinds, as identified by the fathers of information theory, Shannon and Weaver. And critically, communicating data touches on both the rational and the emotional.
Ultimately, I outline 6 Principles of Communicating Data, for which I have also created a handy online checklist. These principles have helped me focus my efforts and avoid some common pitfalls that I have fallen into in the past, such as failing to identify and know my audience, or only considering a subset of the relevant data.
Thanks for visiting, and please be sure to send me a note to tell me what you think about these resources, and what you would add/edit/remove.
Vox published an interesting post today called “America has stopped paying attention, but Ebola is still ravaging Sierra Leone”. It made me wonder whether it’s just America that has stopped paying attention, or if in fact other parts of the world have moved on as well. I turned to Google Trends to look at relative search popularity of the string ‘ebola’ over time for various countries. Here’s what I found:
A quick glance suggests that other unaffected countries have moved on as well. At least these nine have. More specifically, people in these other countries are also using Google to search for the English word ‘ebola’ far less frequently than they were earlier in the year, especially during early to mid October, when the hype hit it’s peak (except in India, where search was highest in early August).
One interesting country to look at is Japan. The dashboard above seems to indicate that no one in Japan cared at all about ebola, all year long. Is that true? No, it’s not true, and it highlights one of the limitations of using this type of data to answer this question. If you do a similar Google Trends search for エボラ, the Japanese word for ‘ebola’, here’s what you get:
It helps to understand exactly what your data is telling you, and what it isn’t telling you.
Ebola is the same word in English, Spanish, Portuguese, German, Italian, and Hindi, so the other countries probably don’t have a similar problem. I wasn’t able to find the Pashto or Dari (Afghanistan) translations of the word “ebola”, but suffice it to say that Google search trends are far less effective a proxy for popular interest in Afghanistan, where only 5.9% of the population uses the internet, according to The World Bank.
Finally, if we compare relative Google search popularity for ‘ebola’ in heavily affected countries like Liberia, Sierra Leone, and Guinea, here’s what we find:
- Get the raw data Excel file here
- Since Google Trends only allowed me to compare 5 countries at a time, I had to run two separate queries, with United States included in both queries to maintain a common reference point of comparison.
- After running the queries, I downloaded the data as a CSV by clicking on the gear icon in the top right corner of the Google Trends page.
- I combined both CSV downloads into one spreadsheet and used the Tableau Reshaper Excel Add-in (Windows only) to convert the resulting cross-tab table into a long list of data values – a single row for each week for each country.
As always, let me know if you have any thoughts about this topic, my approach to understanding it, or the visualization I created to communicate my findings.
Thanks for stopping by,
At the Tableau Public blog, we’ve chosen to focus on political data visualizations during the month of October, since election day in the United States is right around the corner. We’re using the hashtag #VizTheVote to collect our posts and to encourage others to share their thoughts on an aspect of our world that is rich with data (or at least should be) and is also ripe for visualization.
In this blog post, I’m going to show you how to take advantage of a seldom-used mapping feature in Tableau Public 8.2: built-in U.S. Congressional District shapes. First, let’s look at a viz showing the 113th House of Representatives by either age or tenure, then I’ll go into detail about how it was made:
Step 1: Get the Data
If you look at the Wikipedia page showing the “List of current members of the United States House of Representatives by age“, it looks like this:
I copy and pasted this table into Excel and added a column indicating which party each politician belongs to. Step 1 done.
Step 2: Structure the Data
This table is great, but notice the first column – “District”. It combines the state and the congressional district number into one geographic field. In order for Tableau to recognize the congressional district and apply the correct shape, these two fields need to be separated into a “District” column and a “State” column. I did this in Excel using “Text to Columns”. Here’s an image of the final Excel spreadsheet I used to build the viz:
Just the number itself suffices to automatically draw congressional district shapes in Tableau, but there are a few other variations that will also work, as shown in the Geographic Role table below:
Step 3: Visualize the Data
Connect a new Tableau workbook to this spreadsheet and make sure the geographic role for District is set to “Congressional District” (right click on the District pill). Then, do the following:
- Double click on “Latitude (generated)” (goes to Rows) and “Longitude (generated)” (goes to Columns)
- Drag both “District” and “State” to the Detail shelf
- Change the Marks type from Automatic to Filled Map
- Drag “Age” to the Color shelf (Age is a calculated field calculating the DATEDIFF between today and the age of birth)
Here’s an image of the map that shows the age of each member of the House of Representatives, with darker colors indicating older reps:
Step 4: Create the Dashboard
This last part is Tableau 101 and a maybe a little bit of 201. I won’t go into detail about how to create the additional Sheets and combine them on a single Dashboard, as I go into detail on how to do this in Chapters 13 and 14 of my book Communicating Data with Tableau.
Interesting to notice a few things: that congressional districts are split on a coastal vs land-locked basis, that some members of the House are quite old and have hung on to their seat upwards of 5 or 6 decades (John Dingell, Michigan 12). Mostly, though, I hope you notice that creating choropleths of congressional districts in Tableau is quite easy.
For more data at a Congressional District level, check out the U.S. Census Bureau “American Fact Finder” table, or use this CSV I downloaded from the Census site that is ready to import directly into Tableau.
Thanks for stopping by,
I’m a big fan of small multiples in data viz, and I’m somewhat of a “Maphead” as well. Naturally, combining the two together results in a visualization that I’d vouch for almost any time. Kyle Kim of the LA Times just published a stunning series of 192 maps showing drought levels in California by week, going back to January 4, 2011. Small Multiple Maps can take up a lot of space, but they’re very effective at showing change over both time and geography. Judge for yourself.
I look at a lot of Tableau Public visualizations, but I don’t see a lot of “small multiple maps” out there. It’s not that they don’t exist, they’re just rare. They’re actually pretty easy to make, so I thought I’d show you one and walk you through how to create one for yourself. Here’s a small multiples map showing FEMA declared disasters, by county, since 1953:
If you want to follow this brief tutorial, first download this Excel file of FEMA disasters.
How to Make a Small Multiples Map in Tableau
There are at least two different ways you can create small multiples maps in Tableau. One way is to create a bunch of individual maps as Sheets and drag and drop them all onto a single Dashboard. The other way is to create a single Sheet with a grid of small maps. This blog post covers the second method, which has the advantage that the “OpenStreetMaps” attribution only occurs once in the bottom left corner, instead of once for each multiple.
Step 1: Create a basic map
I started by creating a basic choropleth map of continental US counties. I double clicked on the county data field (Declared County/Area) and then dragged “Number of Records” to the Color Shelf. I filtered out the states and territories not in the “lower 48″, I changed the Color to red, set country shape borders to “None”, and edited the Map Options to only show the coastline and borders:
Step 2: Create a “Row Number” and “Column Number” Calculated Field
There are 22 different “Incident Types” (so, plenty of material for Hollywood), but for this project I wanted to create a 3X3 grid, so I needed to identify the top 9 Incident Types. From a simple bar chart showing counts of Incident Type over the full date range, I found that (in descending order of frequency) Severe Storm(s), Flood, Hurricane, Snow, Fire, Severe Ice Storm, Tornado, Drought and Coastal Storm were the ones to include.
I wanted to put each of the 9 top Incident Types in its own box on the 3X3 grid starting with the least frequent type of the 9 (Coastal Storm) in the top left and working my way down to the most frequent (Severe Storm) in the bottom right. Each of the nine then would have a Row Number (1-3) and a Column Number (1-3). I created two new Calculated Fields (right click in the Dimensions or Measures area and select “Create Calculated Field”) to place each in its proper location:
Step 3: Use “Row Number” and “Column Number” to create the grid
Now that the grid location fields are created, I just needed to drag “Row Number” to the Row Shelf and “Column Number” to the Column Shelf, and change both from SUM to a Dimension. When I used a Quick Filter to only include the 9 top fields, I had my small multiples view:
Step 4: Formatting
The rest is mostly clean-up, really. Hiding the Row and Column Headers, customizing the Tooltips, adding a date Quick Filter, and placing the small multiples map on a Dashboard. In the Dashboard, the titles for the 9 boxes are actually 9 very similar Sheets with Incident Type and Number of Records added as Text and filtered to just one of the nine incident types.
What do you think? Easy to make, right? Pretty effective as well, wouldn’t you say?
I’d love to hear your thoughts, and thanks for stopping by,
PS. Coastal Storms seems to be occurring in rather… non-coastal areas in the country. Not entirely sure why, but I’m guessing it’s a misclassification by FEMA. If anyone knows the story, I’d love to know.