Skip to content

Sunset Color Palettes

2015 March 20
by Ben Jones

Earlier this week I was visiting family in Ventura County, California. I had a nice view of the sunset one evening, and I noticed how much the color palette of the sky changed in seven minute increments. I used an iPhone to snap the pictures, Instant Eyedropper for Windows to pull out the hex codes of the uploaded images, and the website to create the color squares beneath. Enjoy!


Tapestry 2015: Seven Data Story Types

2015 March 4
by Ben Jones

Here is my 2015 Tapestry Conference presentation using Freedom of the Press data (Excel, source). You can follow the live feed at 10:30am ET on March 4th to hear the presentation.

Avoiding Data Pitfalls, Part 2: Fooled by Small Samples

2015 January 22
tags: ,
by Ben Jones

adpPrevious – “Part 1: Gaps Between Data and Reality

I’m reading “Thinking, Fast and Slow” by Nobel Prize winner Daniel Kahneman – a frighteningly interesting book about cognitive biases and “heursitics” (rules of thumb) in decision making. If you deal with numbers at all and haven’t read it yet, you should. In it he refers to an article by Howard Wainer and Harris L. Zwerling called “Evidence That Smaller Schools Do Not Improve Student Achievement” that talks about kidney cancer rates, of all things.

Kidney cancer is a relatively rare form of cancer, accounting for only ~3% of all adult cancers. If you look at kidney cancer rates by county in the U.S. an interesting pattern emerges, as he describes on page 109 of his book:

The counties in which the incidence of kidney cancer is lowest are mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West. p.109

What do you make of this? He goes on to list some of the reasons people come up with in an attempt to rationalize this fact: residents of rural counties have access to fresh food, lack of air pollution, etc. Did these explanations come to your mind, too? He then points out the following:

“Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.”

Again, people come up with various theories to explain this fact: rural counties have relatively high poverty rates, high fat diet, lack of access to medication, etc.

But wait – what’s going on here? Rural counties have both the highest and the lowest kidney cancer rates? What gives?

Insensitivity to Sample Size

This is a great example of a bias known as “insensitivity to sample size“. It goes like this: when we deal with data, we don’t take into account sample size when we think about probability. These rural counties have relatively few people, and as such, they are more likely to have either very high or very low incidence rates. Why? Because the variance of the mean is proportional to the sample size. The smaller the sample, the greater the variance (proof).

I found the 2007-2011 kidney cancer rate data and the 2010 population data for each U.S. county, and created this interactive graphic to illustrate the point that Kahneman, Wainer and Zwerlink are trying to make:

Notice a few things in the dashboard above:

  1. In the choropleth map, the darkest orange (high rates relative to the overall U.S. rate) and the darkest blue (low rates relative to the overall U.S. rate) counties are often right next to each other
  2. In the scatterplot below the map, the marks form a funnel shape, with less populous counties (to the left) more likely to deviate from the reference line (the overall US rate), and more populous counties like Chicago, L.A. and New York more likely to be close to the overall reference line
  3. If you hover over a county with a small population, you will notice that the average number of cases per year is extremely low – 4 cases or less sometimes. A small deviation – even just 1 or 2 cases – in a subsequent year will shoot a county from the bottom of the list to the top, or vice versa

Other Examples

Where else does “insensitivity to sample size” come up? My colleague Dash Davidson suggested “streaks” in sports, which can often be just a “clustering illusion“. We look at a brief sample of a player’s overall performance and notice temporary periods of greatness. We should expect to see such streaks for even mediocre players. Remember Linsanity? Similarly, small samples make some rich and others poor in the world of gambling. You may have a good day at the tables, but if you keep playing, eventually the house will win. And in investing, “diversification” is nothing more than a strategy to minimize exposure to extreme downside risks of individual securities (think Enron).

Kahneman and his long-time partner Amos Tversky even showed that 84 professional psychologists were subject to this very same bias, so experts are not immune.

Avoiding this Pitfall

So what do we do about it? How do we make sure we don’t fall into the pitfall known as “insensitivity to sample size”?

  • Be aware of any sampling involved in the data we are analyzing
  • Understand that the smaller the sample size, the more likely we will see a rate or statistic that deviates significantly from the population
  • Before forming theories about why a particular sample deviates from the population in some way, first consider that it may just be noise and chance
  • Visualize the rate or statistic associated with groups of varying size in a scatterplot. If you see the telltale funnel shape, then you know not to be fooled

In Conclusion

lostsprings_0The point of the original article by Wainer and Zwerling is that smaller schools are apt to yield extreme test scores by virtue of the fact that there aren’t enough students in small schools to “even out” the scores. A random cluster of extremely good (or bad) performers can sway a small school’s scores. At a very big school, yes a few bad results will still affect the overall mean, but not nearly as much.

Here’s another way to think of it: if Daniel Kahneman ever moved to Lost Springs, Wyoming, then half of the town’s population would be Nobel Prize winners. And if you think that moving there would increase your chances of winning the Nobel Prize, or that it’s “in the water” or some other such reason, then you’re suffering from a severe case of insensitivity to sample size.

Do any other examples of this pitfall come to mind? Ever fall into it yourself? Share by leaving a comment below.

Thanks for stopping by,

Avoiding Data Pitfalls, Part 1: Gaps Between Data and Reality

2015 January 6
by Ben Jones

adpHappy New Year! In 2015 I’ll be publishing a periodic series of blog posts entitled “Avoiding Data Pitfalls” where I’ll suggest ways to avoid common errors people make when working with data. The pitfalls range from philosophical to technical, and from analytical to visual. I’m familiar with these pitfalls because I’ve fallen into them myself, some of them repeatedly. If I’m the only one that these posts keep out of trouble, then it’ll be worth it.

We fall head first into a pitfall when we fail to remember that a gap exists between our data and reality. Do people really fail to realize this? I see (and make) this mistake quite often. I’m starting with this one because it’s foundational, dealing with the grounds and limits of our knowledge. How does it work?

It works like this: we get some data, and run with it, never stopping to think about where it came from, who collected it, what it tells us, and, importantly, what it doesn’t tell us.

It’s easy when working with data to treat it as reality rather than data collected about reality. Here are some examples:

  • It’s not crime, it’s reported crime.
  • It’s not the number of meteor strikes, it’s the number of recorded meteor strikes.
  • It’s not the outer diameter of a mechanical part, it’s the measured outer diameter.
  • It’s not how the public feels about a controversial topic, it’s how survey respondents are willing to say they feel.
  • It’s not how many people suffer from a particular disease, it’s how many times a doctor diagnosed people with a particular disease.

You get the picture. This distinction may seem like a technicality, and sometimes it is (the number of home runs Hank Aaron “reportedly” hit?) but it can also be a big deal. Let’s see an example of how missing it can lead us astray:

Example #1: Actual vs. Recorded Earthquakes

Consider earthquakes. The USGS provides a Earthquake Archive Search form that lets visitors obtain a list of historical earthquakes that meet various criteria. A query of earthquakes of magnitude 6.0 and above from 1900 to 2013 yields the following, somewhat alarming, line plot:

Are we really to believe that earthquakes have increased in frequency by this much? Obviously not. The world that measured and collected earthquakes in the early 20th century was very different than the one that did so in the last decade. Comparisons across decades, and even within some decades (the 1960s), aren’t “apples-to-apples” due to the changes in technology.

If we separate the line plot by magnitude and add annotations that describe advances in seismology, we see that the rise is only in the smaller group (magnitude 6.0-6.9), and coincides with dramatic improvements in instrumentation:

It’s safe to say that the rise in recorded earthquakes is primarily due to the improvements in our ability to detect them. There may also be an upward trend in actual earthquakes over this time, but our data doesn’t reveal it to us due to the continual changes in the quality of the measurement system. When it comes to earthquakes, the gap between data and reality is getting smaller. The problem is that the “data-reality gap” is changing over the time period we’re considering. And it’s hard to know for sure exactly how many magnitude 6.0 earthquakes we missed in any particular year.

Let’s look at another example – counting bicycles that cross a bridge.

Example #2: Counting Bicycles

Everyday on my way to work I walk across the Fremont Bridge. It’s a bright blue and orange double-leaf bascule bridge that was built in 1917. Since it sits so close to the water, it opens on average 35 times a day, which supposedly makes it the most opened drawbridge in the United States. Here’s what it looks like:


The City of Seattle Department of Transportation has installed two “inductive loops” on the pedestrian/bicycle pathways of the bridge that are designed to count the number of bicycles that cross the bridge in either direction, all day every day. The city also provides hourly counts going back to October 2, 2012 at Downloading this data and visualizing it yields the following timeline:

I showed this data at a recent luncheon of the Puget Sound Research Forum, and asked what the attendees thought of these spikes. I honestly didn’t know what had caused them. A few ideas sprang from the crowd – was it “bike to work day”, really good weather, or maybe there was some big bike race or club event? Notice how each of these ideas is based on the assumption that there actually were more bikes that crossed the bridge on those days.

David Bauer was in the audience and found the answer for us: equipment error. The counters just glitched for a few hours on both days. You can read all the details of these anomalous readings and the correspondence between a local blogger and a city employee at the Seattle Bike Blog. I won’t repeat it here, but suffice it to say measuring things can be tricky. Turns out a low battery was the culprit.

Let’s consider one last example: counting Ebola deaths.

Example #3: Ebola Deaths

This past year, the whole world watched in horror as Ebola ravaged West Africa. It’s still happening, by the way, we’ve just stopped noticing. In any case, the WHO provides data about fatalities in weekly situation reports. I had an interesting discussion on twitter with Alex McDonnell about this data. In it he referred to errors in the WHO reports.

Errors? About one of the world’s most closely followed topics? From one of the world’s most respected organizations? You bet.

Let’s take a look at a timeline of cumulative deaths from Ebola as reported by WHO and CDC. Notice the drops in cumulative death counts – the handful of times when the lines slope down:

Of course it makes perfect sense: the task of diagnosing disease and ascertaining causes of death in some of the more remote locations, where the equipment and staff are often severely limited, must be incredibly difficult. If you read the WHO situation report you’ll notice that they classify cases as “confirmed”, “probable” and “suspected”. It’s not always so obvious. Here are the criteria:


The WHO and CDC actually do a very good job of speaking clearly about “reported” cases (the Dec 31st WHO situation report includes the word “reported” no less than 61 times).

I don’t bring up this example to criticize the people or organizations involved with fighting and documenting the Ebola outbreak. Far from it. This example merely demonstrates that the gap between data and reality can exist even when the stakes are high. Classifying diseases and deaths in chaotic conditions can be tricky business indeed.

How to Avoid Confusing Data with Reality

Notice that in these three examples – 1. earthquakes (a dubious trend), 2. bicycle counting (a spike or outlier), and 3. Ebola deaths (a downward slope in a cumulative line plot) – something in the view of the data itself alerted us to a potential “data-reality gap”. Visualizing the data can be one of the best ways to find problems with it.

Earlier in the game, though, it helps to remind ourselves that every data point that exists was collected, stored, accessed, etc, via imperfect processes. The more we know about these process – the equipment used, the protocol followed, the people involved, the steps they took, their motivations – the better equipped we will be to assess the “data-reality gap”.

Here are six seven suggestions to help you avoid confusing data with reality:

  1. Clearly understand the operational definitions of all metrics
  2. Draw the data collection steps as a process flow diagram
  3. Understand the limitations and inaccuracies of each step in the process
  4. Identify any changes in method or equipment over time
  5. Seek to understand the motives of the people collecting and reporting. Could there be any biases or incentives involved?
  6. Visualize the data and investigate any shifts, outliers and trends for possible discrepancies
  7. Think carefully about data formatting, processing, and transformations (thanks Keith!)

Ultimately, each data collection activity is unique, and there are too many possible sources of error to list them all. These are some typical ones that I’ve come in contact with, and you may have your own suggestions. I’d love to hear them.

In Conclusion

At the core of this first “data pitfall” is our attitude toward data. Do we arrogantly or naively see ourselves as experts on a topic as soon as we get our hands on some data, or do we humbly realize that our knowledge is imperfect, and we may not know the full story?

We can’t ever perfectly know the “data-reality gap” because that would require perfect data. What we can do, though, is seek to identify any gaps that may exist, and take that into account when we use data to form our opinions.

Thanks for reading,

Next: Part 2: Fooled by Small Samples

Kobe Vs. Jordan: what defines a “career”?

2014 December 16
by Ben Jones

Since Kobe Bryant surpassed Michael Jordan in “career points” scored this past Sunday, much has been written about who is better, etc. If a) you’ve heard enough already, or b) you don’t care about sports at all, then you may stop reading now. There is a finer point about “data definitions”, but I’ll leave that for later. First, back to basketball player egos:

There is one thing about this debate that puzzles me: why regular season stats alone are used in the “career” totals. I can understand leaving out the All-Star game stats, as the All-Star game is largely a meaningless pick-up game (and crazy points-fest). But playoffs? NBA basketball is all about the playoffs. That’s where the games really matter, the real drama unfolds, and Hall-of-Fame “careers” are made.

Bryant and Jordan have both had considerable success in the playoffs, but somehow the points they scored during these crucial games don’t count to their career totals? I don’t get it. If there’s a compelling reason to leave out playoff stats from “career totals”, I’m unaware of it.

And if you take into consideration playoff point totals when tallying career stats, Bryant hasn’t passed Jordan quite yet:

Of course you can make the argument that this just postpones the inevitable by a few weeks, as Bryant will soon surpass Jordan in total points scored including the playoffs. So who cares?

To me, it just shows how we can get wrapped up in debates about numbers without stopping to consider the “data definitions” – exactly what are we comparing? How is this data collected, what is included and what is not included? Does it even make sense?

Did you know all those dramatic game-winning shots these two players made in crunch time aren’t even included in their career totals?

Like this one:

And these ones:

I bet you didn’t. If you did, can you defend it? I can’t.

At some point in the next month or so, Kobe will pass Jordan in the amount of total points he has scored in both the regular season and the playoffs combined. There will be no hugging at center court, no interviews, no headlines or blog posts about the stats. That’s not such a bad thing, though, I guess.

As for who is the better scorer? Jordan scored at a higher rate, but took two extended breaks during his career. Bryant took no such breaks, and only had one considerable setback to injury, so he racked up points at a slower, albeit unabated pace. And for all the flak Bryant has taken for being a ball hog, try comparing his career Assists with Jordan’s. He passed Jordan in assists two years ago.

Details, details…

Thanks for indulging me,

Free Sampler of Communicating Data with Tableau

2014 December 8
by Ben Jones

In June of this year I published my first book with O’Reilly Media called “Communicating Data with Tableau”. It has been great to hear from readers around the world, and I’m grateful for the reviews that have been published.

Here is a sampler that includes the entirety of Chapter 1 (entitled “Communicating Data”) in pdf format, freely available for anyone to download and share (click to download the pdf):

Click to download a sampler pdf of CDwT, including the entirety of Chapter 1 – Communicating Data
A word about Chapter 1: This was my favorite chapter to write, and it was also the most difficult chapter to write. You should know that it’s very different from the other 13 chapters in the book. I call it my “ideas chapter”. The others are far more practical and technical.

In Chapter 1, I share my thoughts on the concept of creating and sharing data visualizations as a particular form of the communication process. Thinking of it that way has been very helpful for me, because it frames the activity as an attempt to affect the minds of others.

Just like other types of communication such speech and even body language, its success depends on many factors, some obvious and some subtle. It is subject to problems of various kinds, as identified by the fathers of information theory, Shannon and Weaver. And critically, communicating data touches on both the rational and the emotional.

Ultimately, I outline 6 Principles of Communicating Data, for which I have also created a handy online checklist. These principles have helped me focus my efforts and avoid some common pitfalls that I have fallen into in the past, such as failing to identify and know my audience, or only considering a subset of the relevant data.

Thanks for visiting, and please be sure to send me a note to tell me what you think about these resources, and what you would add/edit/remove.


Visualizing the Ebola Scare Using Google Trends & Tableau

2014 December 2
by Ben Jones

Vox published an interesting post today called “America has stopped paying attention, but Ebola is still ravaging Sierra Leone”. It made me wonder whether it’s just America that has stopped paying attention, or if in fact other parts of the world have moved on as well. I turned to Google Trends to look at relative search popularity of the string ‘ebola’ over time for various countries. Here’s what I found:

A quick glance suggests that other unaffected countries have moved on as well. At least these nine have. More specifically, people in these other countries are also using Google to search for the English word ‘ebola’ far less frequently than they were earlier in the year, especially during early to mid October, when the hype hit it’s peak (except in India, where search was highest in early August).

One interesting country to look at is Japan. The dashboard above seems to indicate that no one in Japan cared at all about ebola, all year long. Is that true? No, it’s not true, and it highlights one of the limitations of using this type of data to answer this question. If you do a similar Google Trends search for エボラ, the Japanese word for ‘ebola’, here’s what you get:

It helps to understand exactly what your data is telling you, and what it isn’t telling you.

Ebola is the same word in English, Spanish, Portuguese, German, Italian, and Hindi, so the other countries probably don’t have a similar problem. I wasn’t able to find the Pashto or Dari (Afghanistan) translations of the word “ebola”, but suffice it to say that Google search trends are far less effective a proxy for popular interest in Afghanistan, where only 5.9% of the population uses the internet, according to The World Bank.

Finally, if we compare relative Google search popularity for ‘ebola’ in heavily affected countries like Liberia, Sierra Leone, and Guinea, here’s what we find:

Project Notes:

  • Get the raw data Excel file here
  • Since Google Trends only allowed me to compare 5 countries at a time, I had to run two separate queries, with United States included in both queries to maintain a common reference point of comparison.
  • After running the queries, I downloaded the data as a CSV by clicking on the gear icon in the top right corner of the Google Trends page.
  • I combined both CSV downloads into one spreadsheet and used the Tableau Reshaper Excel Add-in (Windows only) to convert the resulting cross-tab table into a long list of data values – a single row for each week for each country.

As always, let me know if you have any thoughts about this topic, my approach to understanding it, or the visualization I created to communicate my findings.

Thanks for stopping by,

U.S. Congressional District choropleths made easy

2014 October 20
by Ben Jones

At the Tableau Public blog, we’ve chosen to focus on political data visualizations during the month of October, since election day in the United States is right around the corner. We’re using the hashtag #VizTheVote to collect our posts and to encourage others to share their thoughts on an aspect of our world that is rich with data (or at least should be) and is also ripe for visualization.

In this blog post, I’m going to show you how to take advantage of a seldom-used mapping feature in Tableau Public 8.2: built-in U.S. Congressional District shapes. First, let’s look at a viz showing the 113th House of Representatives by either age or tenure, then I’ll go into detail about how it was made:

Step 1: Get the Data
If you look at the Wikipedia page showing the “List of current members of the United States House of Representatives by age“, it looks like this:


I copy and pasted this table into Excel and added a column indicating which party each politician belongs to. Step 1 done.

Step 2: Structure the Data
This table is great, but notice the first column – “District”. It combines the state and the congressional district number into one geographic field. In order for Tableau to recognize the congressional district and apply the correct shape, these two fields need to be separated into a “District” column and a “State” column. I did this in Excel using “Text to Columns”. Here’s an image of the final Excel spreadsheet I used to build the viz:


Just the number itself suffices to automatically draw congressional district shapes in Tableau, but there are a few other variations that will also work, as shown in the Geographic Role table below:


Step 3: Visualize the Data
Connect a new Tableau workbook to this spreadsheet and make sure the geographic role for District is set to “Congressional District” (right click on the District pill). Then, do the following:

  1. Double click on “Latitude (generated)” (goes to Rows) and “Longitude (generated)” (goes to Columns)
  2. Drag both “District” and “State” to the Detail shelf
  3. Change the Marks type from Automatic to Filled Map
  4. Drag “Age” to the Color shelf (Age is a calculated field calculating the DATEDIFF between today and the age of birth)

Here’s an image of the map that shows the age of each member of the House of Representatives, with darker colors indicating older reps:


Step 4: Create the Dashboard
This last part is Tableau 101 and a maybe a little bit of 201. I won’t go into detail about how to create the additional Sheets and combine them on a single Dashboard, as I go into detail on how to do this in Chapters 13 and 14 of my book Communicating Data with Tableau.

Interesting to notice a few things: that congressional districts are split on a coastal vs land-locked basis, that some members of the House are quite old and have hung on to their seat upwards of 5 or 6 decades (John Dingell, Michigan 12). Mostly, though, I hope you notice that creating choropleths of congressional districts in Tableau is quite easy.

For more data at a Congressional District level, check out the U.S. Census Bureau “American Fact Finder” table, or use this CSV I downloaded from the Census site that is ready to import directly into Tableau.

Thanks for stopping by,

How to Make Small Multiple Maps in Tableau

2014 September 6
by Ben Jones

I’m a big fan of small multiples in data viz, and I’m somewhat of a “Maphead” as well. Naturally, combining the two together results in a visualization that I’d vouch for almost any time. Kyle Kim of the LA Times just published a stunning series of 192 maps showing drought levels in California by week, going back to January 4, 2011. Small Multiple Maps can take up a lot of space, but they’re very effective at showing change over both time and geography. Judge for yourself.

I look at a lot of Tableau Public visualizations, but I don’t see a lot of “small multiple maps” out there. It’s not that they don’t exist, they’re just rare. They’re actually pretty easy to make, so I thought I’d show you one and walk you through how to create one for yourself. Here’s a small multiples map showing FEMA declared disasters, by county, since 1953:

If you want to follow this brief tutorial, first download this Excel file of FEMA disasters.

How to Make a Small Multiples Map in Tableau

There are at least two different ways you can create small multiples maps in Tableau. One way is to create a bunch of individual maps as Sheets and drag and drop them all onto a single Dashboard. The other way is to create a single Sheet with a grid of small maps. This blog post covers the second method, which has the advantage that the “OpenStreetMaps” attribution only occurs once in the bottom left corner, instead of once for each multiple.

Step 1: Create a basic map

I started by creating a basic choropleth map of continental US counties. I double clicked on the county data field (Declared County/Area) and then dragged “Number of Records” to the Color Shelf. I filtered out the states and territories not in the “lower 48″, I changed the Color to red, set country shape borders to “None”, and edited the Map Options to only show the coastline and borders:

Step 2: Create a “Row Number” and “Column Number” Calculated Field

There are 22 different “Incident Types” (so, plenty of material for Hollywood), but for this project I wanted to create a 3X3 grid, so I needed to identify the top 9 Incident Types. From a simple bar chart showing counts of Incident Type over the full date range, I found that (in descending order of frequency) Severe Storm(s), Flood, Hurricane, Snow, Fire, Severe Ice Storm, Tornado, Drought and Coastal Storm were the ones to include.

I wanted to put each of the 9 top Incident Types in its own box on the 3X3 grid starting with the least frequent type of the 9 (Coastal Storm) in the top left and working my way down to the most frequent (Severe Storm) in the bottom right. Each of the nine then would have a Row Number (1-3) and a Column Number (1-3). I created two new Calculated Fields (right click in the Dimensions or Measures area and select “Create Calculated Field”) to place each in its proper location:
rownumbercalc    columnnumbercalc

Step 3: Use “Row Number” and “Column Number” to create the grid

Now that the grid location fields are created, I just needed to drag “Row Number” to the Row Shelf and “Column Number” to the Column Shelf, and change both from SUM to a Dimension. When I used a Quick Filter to only include the 9 top fields, I had my small multiples view:

Step 4: Formatting

The rest is mostly clean-up, really. Hiding the Row and Column Headers, customizing the Tooltips, adding a date Quick Filter, and placing the small multiples map on a Dashboard. In the Dashboard, the titles for the 9 boxes are actually 9 very similar Sheets with Incident Type and Number of Records added as Text and filtered to just one of the nine incident types.

What do you think? Easy to make, right? Pretty effective as well, wouldn’t you say?

I’d love to hear your thoughts, and thanks for stopping by,

PS. Coastal Storms seems to be occurring in rather… non-coastal areas in the country. Not entirely sure why, but I’m guessing it’s a misclassification by FEMA. If anyone knows the story, I’d love to know.

Mapping the World’s Rivers

2014 August 25
by Ben Jones

“All the rivers run into the sea, yet the sea is not full; to the place from which the rivers come, there they return again.” Ecclesiastes 1:7

It was purely coincidental that during #MappingMonth a Tableau Public author reached out to me and asked me if it was possible to create a map with rivers as interactive polylines. He was in the process of gathering coordinates manually from Google Maps, and he felt there had to be a better way. I knew he was right – if we could find a data set with latitude and longitude coordinates for each river, then we could use the Path shelf to draw each river as a line on a world map.

What, exactly, is the use-case for a map of the world’s rivers? I admit I don’t quite know, but it was an interesting challenge, and certainly made for a fun and educational project for my two sons to help me with. You gotta get creative to make sure they learn something during the summer.

After Tableau mapping guru Allan Walker pointed me in the direction of, here’s what we were able to create (see below for a brief tutorial, and a #MappingMonth surprise):

How to Map the World’s Rivers

Step 1: Get the Shapefile

To start with, I had to find the Shapefiles for all of the world’s rivers. At least the big ones. As I mentioned, Allan Walker pointed me in the direction of’s 1:10m Physical Vectors, and uber map geek Nathaniel Kelso helped me find the files to download (he also runs a github account with links to download every NaturalEarthData download file). This resource is truly amazing – it has shapefiles for coastline, oceans, reefs, glaciated areas, and a few more – all freely available. I downloaded a zip file of rivers and lakes centerlines. Step 1 complete.

Step 2: Convert the Shapefile to CSV

This step used to be arduous and time consuming until Alteryx published a Shapefile to Polygon Converter to their Analytics Gallery. It’s a web app that requires a free login, and allows you to take that zip file you just downloaded and turn it into a CSV or a TDE (Tableau Data Extract). Most people are familiar with CSV, so let’s follow that option. Here is the CSV file that the Alteryx converter created for me. Here’s what the CSV file looks like – in particular, note the fields “Polygon ID”, “Subpolygon ID” and “Point ID”. They will play an important role in step 3:

Step 3: Connect Tableau to the CSV and create a Map

Now that you’ve got your CSV, it’s a fairly easy step to use it to create a map in Tableau. Start by connecting Tableau to this CSV, and then do the following in a new Sheet:

  1. Double click Latitude (goes to Rows) and double click Longitude (goes to Columns)
  2. Change Marks from Automatic to Lines
  3. Drag Polygon ID and Sub Polygon ID to the Detail Shelf
  4. Drag Point ID to the Path Shelf

You should now have the basic map of the rivers of the world, and your screen should look something like this:

To complete the viz, I colored the rivers by Scalerank, added two Quick Filters (Scalerank and river Name), formatted the Tooltips, and added the map along with a histogram of Scalerank to a dashboard. I asked my son Aaron to pick the title font, and he picked Brush Script MT because he said he thought the letters looked “rivery”. I couldn’t argue with that, so we made a PNG with transparency and added it as an image (because Brush Script MT isn’t a safe web font).

Now here’s to you, Mr. Robinson

I said I had a surprise, and here it is. I’ve been playing around with (read: obsessing about) different map projections lately. I figured out how to convert the latitude and longitude coordinates into x, y values of the Robinson projection, a projection that the National Geographic Society used from 1988 to 1998, before ditching it in favor of the Winkel tripel. I won’t get into too much detail here, but suffice it to say, the Robinson is a pseudocylindrical projection that’s really only suitable for creating thematic maps of the entire world. Compare it with other projections using this handy summary image. More to come on this soon, but for now, here is the rivers dashboard in the Robinson projection:

Notice that Greenland doesn’t loom as large as it does in the Mercator projection, which distorts it’s size quite a lot (it’s actually 1/8 the surface area of South America). Also notice, however, that the Robinson projection “curves” inward at both poles (latitude lines get shorter as you move away from the equator) – this means that if you were to zoom in to the street level in, say, Finland, streets that cross at a right angle in the real world wouldn’t appear to on the map. That’s what you get with Mercator in return for some area distortion. Every map has its pros and cons.

If you’re interested in building a Robinson projection yourself, here are the equations to make the conversion within Tableau. I recommend either drawing the coastlines and graticules yourself, or finding a good Robinson map image and adding it as a Background Image, fixing the position carefully. Here is the map image I used. It works fairly well when zoomed out to show the entire world, but I hid the Zoom controls since it really doesn’t work well when zoomed in.

Thanks for stopping by,