Previous – “Part 1: Gaps Between Data and Reality“
I’m reading “Thinking, Fast and Slow” by Nobel Prize winner Daniel Kahneman – a frighteningly interesting book about cognitive biases and “heursitics” (rules of thumb) in decision making. If you deal with numbers at all and haven’t read it yet, you should. In it he refers to an article by Howard Wainer and Harris L. Zwerling called “Evidence That Smaller Schools Do Not Improve Student Achievement” that talks about kidney cancer rates, of all things.
Kidney cancer is a relatively rare form of cancer, accounting for only ~3% of all adult cancers. If you look at kidney cancer rates by county in the U.S. an interesting pattern emerges, as he describes on page 109 of his book:
The counties in which the incidence of kidney cancer is lowest are mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West. p.109
What do you make of this? He goes on to list some of the reasons people come up with in an attempt to rationalize this fact: residents of rural counties have access to fresh food, lack of air pollution, etc. Did these explanations come to your mind, too? He then points out the following:
“Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.”
Again, people come up with various theories to explain this fact: rural counties have relatively high poverty rates, high fat diet, lack of access to medication, etc.
But wait – what’s going on here? Rural counties have both the highest and the lowest kidney cancer rates? What gives?
Insensitivity to Sample Size
This is a great example of a bias known as “insensitivity to sample size“. It goes like this: when we deal with data, we don’t take into account sample size when we think about probability. These rural counties have relatively few people, and as such, they are more likely to have either very high or very low incidence rates. Why? Because the variance of the mean is proportional to the sample size. The smaller the sample, the greater the variance (proof).
I found the 2007-2011 kidney cancer rate data and the 2010 population data for each U.S. county, and created this interactive graphic to illustrate the point that Kahneman, Wainer and Zwerlink are trying to make:
Notice a few things in the dashboard above:
- In the choropleth map, the darkest orange (high rates relative to the overall U.S. rate) and the darkest blue (low rates relative to the overall U.S. rate) counties are often right next to each other
- In the scatterplot below the map, the marks form a funnel shape, with less populous counties (to the left) more likely to deviate from the reference line (the overall US rate), and more populous counties like Chicago, L.A. and New York more likely to be close to the overall reference line
- If you hover over a county with a small population, you will notice that the average number of cases per year is extremely low – 4 cases or less sometimes. A small deviation – even just 1 or 2 cases – in a subsequent year will shoot a county from the bottom of the list to the top, or vice versa
Where else does “insensitivity to sample size” come up? My colleague Dash Davidson suggested “streaks” in sports, which can often be just a “clustering illusion“. We look at a brief sample of a player’s overall performance and notice temporary periods of greatness. We should expect to see such streaks for even mediocre players. Remember Linsanity? Similarly, small samples make some rich and others poor in the world of gambling. You may have a good day at the tables, but if you keep playing, eventually the house will win. And in investing, “diversification” is nothing more than a strategy to minimize exposure to extreme downside risks of individual securities (think Enron).
Kahneman and his long-time partner Amos Tversky even showed that 84 professional psychologists were subject to this very same bias, so experts are not immune.
Avoiding this Pitfall
So what do we do about it? How do we make sure we don’t fall into the pitfall known as “insensitivity to sample size”?
- Be aware of any sampling involved in the data we are analyzing
- Understand that the smaller the sample size, the more likely we will see a rate or statistic that deviates significantly from the population
- Before forming theories about why a particular sample deviates from the population in some way, first consider that it may just be noise and chance
- Visualize the rate or statistic associated with groups of varying size in a scatterplot. If you see the telltale funnel shape, then you know not to be fooled
The point of the original article by Wainer and Zwerling is that smaller schools are apt to yield extreme test scores by virtue of the fact that there aren’t enough students in small schools to “even out” the scores. A random cluster of extremely good (or bad) performers can sway a small school’s scores. At a very big school, yes a few bad results will still affect the overall mean, but not nearly as much.
Here’s another way to think of it: if Daniel Kahneman ever moved to Lost Springs, Wyoming, then half of the town’s population would be Nobel Prize winners. And if you think that moving there would increase your chances of winning the Nobel Prize, or that it’s “in the water” or some other such reason, then you’re suffering from a severe case of insensitivity to sample size.
Do any other examples of this pitfall come to mind? Ever fall into it yourself? Share by leaving a comment below.
Thanks for stopping by,
Happy New Year! In 2015 I’ll be publishing a periodic series of blog posts entitled “Avoiding Data Pitfalls” where I’ll suggest ways to avoid common errors people make when working with data. The pitfalls range from philosophical to technical, and from analytical to visual. I’m familiar with these pitfalls because I’ve fallen into them myself, some of them repeatedly. If I’m the only one that these posts keep out of trouble, then it’ll be worth it.
We fall head first into a pitfall when we fail to remember that a gap exists between our data and reality. Do people really fail to realize this? I see (and make) this mistake quite often. I’m starting with this one because it’s foundational, dealing with the grounds and limits of our knowledge. How does it work?
It works like this: we get some data, and run with it, never stopping to think about where it came from, who collected it, what it tells us, and, importantly, what it doesn’t tell us.
It’s easy when working with data to treat it as reality rather than data collected about reality. Here are some examples:
- It’s not crime, it’s reported crime.
- It’s not the number of meteor strikes, it’s the number of recorded meteor strikes.
- It’s not the outer diameter of a mechanical part, it’s the measured outer diameter.
- It’s not how the public feels about a controversial topic, it’s how survey respondents are willing to say they feel.
- It’s not how many people suffer from a particular disease, it’s how many times a doctor diagnosed people with a particular disease.
You get the picture. This distinction may seem like a technicality, and sometimes it is (the number of home runs Hank Aaron “reportedly” hit?) but it can also be a big deal. Let’s see an example of how missing it can lead us astray:
Example #1: Actual vs. Recorded Earthquakes
Consider earthquakes. The USGS provides a Earthquake Archive Search form that lets visitors obtain a list of historical earthquakes that meet various criteria. A query of earthquakes of magnitude 6.0 and above from 1900 to 2013 yields the following, somewhat alarming, line plot:
Are we really to believe that earthquakes have increased in frequency by this much? Obviously not. The world that measured and collected earthquakes in the early 20th century was very different than the one that did so in the last decade. Comparisons across decades, and even within some decades (the 1960s), aren’t “apples-to-apples” due to the changes in technology.
If we separate the line plot by magnitude and add annotations that describe advances in seismology, we see that the rise is only in the smaller group (magnitude 6.0-6.9), and coincides with dramatic improvements in instrumentation:
It’s safe to say that the rise in recorded earthquakes is primarily due to the improvements in our ability to detect them. There may also be an upward trend in actual earthquakes over this time, but our data doesn’t reveal it to us due to the continual changes in the quality of the measurement system. When it comes to earthquakes, the gap between data and reality is getting smaller. The problem is that the “data-reality gap” is changing over the time period we’re considering. And it’s hard to know for sure exactly how many magnitude 6.0 earthquakes we missed in any particular year.
Let’s look at another example – counting bicycles that cross a bridge.
Example #2: Counting Bicycles
Everyday on my way to work I walk across the Fremont Bridge. It’s a bright blue and orange double-leaf bascule bridge that was built in 1917. Since it sits so close to the water, it opens on average 35 times a day, which supposedly makes it the most opened drawbridge in the United States. Here’s what it looks like:
The City of Seattle Department of Transportation has installed two “inductive loops” on the pedestrian/bicycle pathways of the bridge that are designed to count the number of bicycles that cross the bridge in either direction, all day every day. The city also provides hourly counts going back to October 2, 2012 at data.seattle.gov. Downloading this data and visualizing it yields the following timeline:
I showed this data at a recent luncheon of the Puget Sound Research Forum, and asked what the attendees thought of these spikes. I honestly didn’t know what had caused them. A few ideas sprang from the crowd – was it “bike to work day”, really good weather, or maybe there was some big bike race or club event? Notice how each of these ideas is based on the assumption that there actually were more bikes that crossed the bridge on those days.
David Bauer was in the audience and found the answer for us: equipment error. The counters just glitched for a few hours on both days. You can read all the details of these anomalous readings and the correspondence between a local blogger and a city employee at the Seattle Bike Blog. I won’t repeat it here, but suffice it to say measuring things can be tricky. Turns out a low battery was the culprit.
Let’s consider one last example: counting Ebola deaths.
Example #3: Ebola Deaths
This past year, the whole world watched in horror as Ebola ravaged West Africa. It’s still happening, by the way, we’ve just stopped noticing. In any case, the WHO provides data about fatalities in weekly situation reports. I had an interesting discussion on twitter with Alex McDonnell about this data. In it he referred to errors in the WHO reports.
Errors? About one of the world’s most closely followed topics? From one of the world’s most respected organizations? You bet.
Let’s take a look at a timeline of cumulative deaths from Ebola as reported by WHO and CDC. Notice the drops in cumulative death counts – the handful of times when the lines slope down:
Of course it makes perfect sense: the task of diagnosing disease and ascertaining causes of death in some of the more remote locations, where the equipment and staff are often severely limited, must be incredibly difficult. If you read the WHO situation report you’ll notice that they classify cases as “confirmed”, “probable” and “suspected”. It’s not always so obvious. Here are the criteria:
The WHO and CDC actually do a very good job of speaking clearly about “reported” cases (the Dec 31st WHO situation report includes the word “reported” no less than 61 times).
I don’t bring up this example to criticize the people or organizations involved with fighting and documenting the Ebola outbreak. Far from it. This example merely demonstrates that the gap between data and reality can exist even when the stakes are high. Classifying diseases and deaths in chaotic conditions can be tricky business indeed.
How to Avoid Confusing Data with Reality
Notice that in these three examples – 1. earthquakes (a dubious trend), 2. bicycle counting (a spike or outlier), and 3. Ebola deaths (a downward slope in a cumulative line plot) – something in the view of the data itself alerted us to a potential “data-reality gap”. Visualizing the data can be one of the best ways to find problems with it.
Earlier in the game, though, it helps to remind ourselves that every data point that exists was collected, stored, accessed, etc, via imperfect processes. The more we know about these process – the equipment used, the protocol followed, the people involved, the steps they took, their motivations – the better equipped we will be to assess the “data-reality gap”.
six seven suggestions to help you avoid confusing data with reality:
- Clearly understand the operational definitions of all metrics
- Draw the data collection steps as a process flow diagram
- Understand the limitations and inaccuracies of each step in the process
- Identify any changes in method or equipment over time
- Seek to understand the motives of the people collecting and reporting. Could there be any biases or incentives involved?
- Visualize the data and investigate any shifts, outliers and trends for possible discrepancies
- Think carefully about data formatting, processing, and transformations (thanks Keith!)
Ultimately, each data collection activity is unique, and there are too many possible sources of error to list them all. These are some typical ones that I’ve come in contact with, and you may have your own suggestions. I’d love to hear them.
At the core of this first “data pitfall” is our attitude toward data. Do we arrogantly or naively see ourselves as experts on a topic as soon as we get our hands on some data, or do we humbly realize that our knowledge is imperfect, and we may not know the full story?
We can’t ever perfectly know the “data-reality gap” because that would require perfect data. What we can do, though, is seek to identify any gaps that may exist, and take that into account when we use data to form our opinions.
Thanks for reading,
Since Kobe Bryant surpassed Michael Jordan in “career points” scored this past Sunday, much has been written about who is better, etc. If a) you’ve heard enough already, or b) you don’t care about sports at all, then you may stop reading now. There is a finer point about “data definitions”, but I’ll leave that for later. First, back to basketball player egos:
There is one thing about this debate that puzzles me: why regular season stats alone are used in the “career” totals. I can understand leaving out the All-Star game stats, as the All-Star game is largely a meaningless pick-up game (and crazy points-fest). But playoffs? NBA basketball is all about the playoffs. That’s where the games really matter, the real drama unfolds, and Hall-of-Fame “careers” are made.
Bryant and Jordan have both had considerable success in the playoffs, but somehow the points they scored during these crucial games don’t count to their career totals? I don’t get it. If there’s a compelling reason to leave out playoff stats from “career totals”, I’m unaware of it.
And if you take into consideration playoff point totals when tallying career stats, Bryant hasn’t passed Jordan quite yet:
Of course you can make the argument that this just postpones the inevitable by a few weeks, as Bryant will soon surpass Jordan in total points scored including the playoffs. So who cares?
To me, it just shows how we can get wrapped up in debates about numbers without stopping to consider the “data definitions” – exactly what are we comparing? How is this data collected, what is included and what is not included? Does it even make sense?
Did you know all those dramatic game-winning shots these two players made in crunch time aren’t even included in their career totals?
Like this one:
And these ones:
I bet you didn’t. If you did, can you defend it? I can’t.
At some point in the next month or so, Kobe will pass Jordan in the amount of total points he has scored in both the regular season and the playoffs combined. There will be no hugging at center court, no interviews, no headlines or blog posts about the stats. That’s not such a bad thing, though, I guess.
As for who is the better scorer? Jordan scored at a higher rate, but took two extended breaks during his career. Bryant took no such breaks, and only had one considerable setback to injury, so he racked up points at a slower, albeit unabated pace. And for all the flak Bryant has taken for being a ball hog, try comparing his career Assists with Jordan’s. He passed Jordan in assists two years ago.
Thanks for indulging me,
In June of this year I published my first book with O’Reilly Media called “Communicating Data with Tableau”. It has been great to hear from readers around the world, and I’m grateful for the reviews that have been published.
Here is a sampler that includes the entirety of Chapter 1 (entitled “Communicating Data”) in pdf format, freely available for anyone to download and share (click to download the pdf):
In Chapter 1, I share my thoughts on the concept of creating and sharing data visualizations as a particular form of the communication process. Thinking of it that way has been very helpful for me, because it frames the activity as an attempt to affect the minds of others.
Just like other types of communication such speech and even body language, its success depends on many factors, some obvious and some subtle. It is subject to problems of various kinds, as identified by the fathers of information theory, Shannon and Weaver. And critically, communicating data touches on both the rational and the emotional.
Ultimately, I outline 6 Principles of Communicating Data, for which I have also created a handy online checklist. These principles have helped me focus my efforts and avoid some common pitfalls that I have fallen into in the past, such as failing to identify and know my audience, or only considering a subset of the relevant data.
Thanks for visiting, and please be sure to send me a note to tell me what you think about these resources, and what you would add/edit/remove.
Vox published an interesting post today called “America has stopped paying attention, but Ebola is still ravaging Sierra Leone”. It made me wonder whether it’s just America that has stopped paying attention, or if in fact other parts of the world have moved on as well. I turned to Google Trends to look at relative search popularity of the string ‘ebola’ over time for various countries. Here’s what I found:
A quick glance suggests that other unaffected countries have moved on as well. At least these nine have. More specifically, people in these other countries are also using Google to search for the English word ‘ebola’ far less frequently than they were earlier in the year, especially during early to mid October, when the hype hit it’s peak (except in India, where search was highest in early August).
One interesting country to look at is Japan. The dashboard above seems to indicate that no one in Japan cared at all about ebola, all year long. Is that true? No, it’s not true, and it highlights one of the limitations of using this type of data to answer this question. If you do a similar Google Trends search for エボラ, the Japanese word for ‘ebola’, here’s what you get:
It helps to understand exactly what your data is telling you, and what it isn’t telling you.
Ebola is the same word in English, Spanish, Portuguese, German, Italian, and Hindi, so the other countries probably don’t have a similar problem. I wasn’t able to find the Pashto or Dari (Afghanistan) translations of the word “ebola”, but suffice it to say that Google search trends are far less effective a proxy for popular interest in Afghanistan, where only 5.9% of the population uses the internet, according to The World Bank.
Finally, if we compare relative Google search popularity for ‘ebola’ in heavily affected countries like Liberia, Sierra Leone, and Guinea, here’s what we find:
- Get the raw data Excel file here
- Since Google Trends only allowed me to compare 5 countries at a time, I had to run two separate queries, with United States included in both queries to maintain a common reference point of comparison.
- After running the queries, I downloaded the data as a CSV by clicking on the gear icon in the top right corner of the Google Trends page.
- I combined both CSV downloads into one spreadsheet and used the Tableau Reshaper Excel Add-in (Windows only) to convert the resulting cross-tab table into a long list of data values – a single row for each week for each country.
As always, let me know if you have any thoughts about this topic, my approach to understanding it, or the visualization I created to communicate my findings.
Thanks for stopping by,
At the Tableau Public blog, we’ve chosen to focus on political data visualizations during the month of October, since election day in the United States is right around the corner. We’re using the hashtag #VizTheVote to collect our posts and to encourage others to share their thoughts on an aspect of our world that is rich with data (or at least should be) and is also ripe for visualization.
In this blog post, I’m going to show you how to take advantage of a seldom-used mapping feature in Tableau Public 8.2: built-in U.S. Congressional District shapes. First, let’s look at a viz showing the 113th House of Representatives by either age or tenure, then I’ll go into detail about how it was made:
Step 1: Get the Data
If you look at the Wikipedia page showing the “List of current members of the United States House of Representatives by age“, it looks like this:
I copy and pasted this table into Excel and added a column indicating which party each politician belongs to. Step 1 done.
Step 2: Structure the Data
This table is great, but notice the first column – “District”. It combines the state and the congressional district number into one geographic field. In order for Tableau to recognize the congressional district and apply the correct shape, these two fields need to be separated into a “District” column and a “State” column. I did this in Excel using “Text to Columns”. Here’s an image of the final Excel spreadsheet I used to build the viz:
Just the number itself suffices to automatically draw congressional district shapes in Tableau, but there are a few other variations that will also work, as shown in the Geographic Role table below:
Step 3: Visualize the Data
Connect a new Tableau workbook to this spreadsheet and make sure the geographic role for District is set to “Congressional District” (right click on the District pill). Then, do the following:
- Double click on “Latitude (generated)” (goes to Rows) and “Longitude (generated)” (goes to Columns)
- Drag both “District” and “State” to the Detail shelf
- Change the Marks type from Automatic to Filled Map
- Drag “Age” to the Color shelf (Age is a calculated field calculating the DATEDIFF between today and the age of birth)
Here’s an image of the map that shows the age of each member of the House of Representatives, with darker colors indicating older reps:
Step 4: Create the Dashboard
This last part is Tableau 101 and a maybe a little bit of 201. I won’t go into detail about how to create the additional Sheets and combine them on a single Dashboard, as I go into detail on how to do this in Chapters 13 and 14 of my book Communicating Data with Tableau.
Interesting to notice a few things: that congressional districts are split on a coastal vs land-locked basis, that some members of the House are quite old and have hung on to their seat upwards of 5 or 6 decades (John Dingell, Michigan 12). Mostly, though, I hope you notice that creating choropleths of congressional districts in Tableau is quite easy.
For more data at a Congressional District level, check out the U.S. Census Bureau “American Fact Finder” table, or use this CSV I downloaded from the Census site that is ready to import directly into Tableau.
Thanks for stopping by,
I’m a big fan of small multiples in data viz, and I’m somewhat of a “Maphead” as well. Naturally, combining the two together results in a visualization that I’d vouch for almost any time. Kyle Kim of the LA Times just published a stunning series of 192 maps showing drought levels in California by week, going back to January 4, 2011. Small Multiple Maps can take up a lot of space, but they’re very effective at showing change over both time and geography. Judge for yourself.
I look at a lot of Tableau Public visualizations, but I don’t see a lot of “small multiple maps” out there. It’s not that they don’t exist, they’re just rare. They’re actually pretty easy to make, so I thought I’d show you one and walk you through how to create one for yourself. Here’s a small multiples map showing FEMA declared disasters, by county, since 1953:
If you want to follow this brief tutorial, first download this Excel file of FEMA disasters.
How to Make a Small Multiples Map in Tableau
There are at least two different ways you can create small multiples maps in Tableau. One way is to create a bunch of individual maps as Sheets and drag and drop them all onto a single Dashboard. The other way is to create a single Sheet with a grid of small maps. This blog post covers the second method, which has the advantage that the “OpenStreetMaps” attribution only occurs once in the bottom left corner, instead of once for each multiple.
Step 1: Create a basic map
I started by creating a basic choropleth map of continental US counties. I double clicked on the county data field (Declared County/Area) and then dragged “Number of Records” to the Color Shelf. I filtered out the states and territories not in the “lower 48″, I changed the Color to red, set country shape borders to “None”, and edited the Map Options to only show the coastline and borders:
Step 2: Create a “Row Number” and “Column Number” Calculated Field
There are 22 different “Incident Types” (so, plenty of material for Hollywood), but for this project I wanted to create a 3X3 grid, so I needed to identify the top 9 Incident Types. From a simple bar chart showing counts of Incident Type over the full date range, I found that (in descending order of frequency) Severe Storm(s), Flood, Hurricane, Snow, Fire, Severe Ice Storm, Tornado, Drought and Coastal Storm were the ones to include.
I wanted to put each of the 9 top Incident Types in its own box on the 3X3 grid starting with the least frequent type of the 9 (Coastal Storm) in the top left and working my way down to the most frequent (Severe Storm) in the bottom right. Each of the nine then would have a Row Number (1-3) and a Column Number (1-3). I created two new Calculated Fields (right click in the Dimensions or Measures area and select “Create Calculated Field”) to place each in its proper location:
Step 3: Use “Row Number” and “Column Number” to create the grid
Now that the grid location fields are created, I just needed to drag “Row Number” to the Row Shelf and “Column Number” to the Column Shelf, and change both from SUM to a Dimension. When I used a Quick Filter to only include the 9 top fields, I had my small multiples view:
Step 4: Formatting
The rest is mostly clean-up, really. Hiding the Row and Column Headers, customizing the Tooltips, adding a date Quick Filter, and placing the small multiples map on a Dashboard. In the Dashboard, the titles for the 9 boxes are actually 9 very similar Sheets with Incident Type and Number of Records added as Text and filtered to just one of the nine incident types.
What do you think? Easy to make, right? Pretty effective as well, wouldn’t you say?
I’d love to hear your thoughts, and thanks for stopping by,
PS. Coastal Storms seems to be occurring in rather… non-coastal areas in the country. Not entirely sure why, but I’m guessing it’s a misclassification by FEMA. If anyone knows the story, I’d love to know.
“All the rivers run into the sea, yet the sea is not full; to the place from which the rivers come, there they return again.” Ecclesiastes 1:7
It was purely coincidental that during #MappingMonth a Tableau Public author reached out to me and asked me if it was possible to create a map with rivers as interactive polylines. He was in the process of gathering coordinates manually from Google Maps, and he felt there had to be a better way. I knew he was right – if we could find a data set with latitude and longitude coordinates for each river, then we could use the Path shelf to draw each river as a line on a world map.
What, exactly, is the use-case for a map of the world’s rivers? I admit I don’t quite know, but it was an interesting challenge, and certainly made for a fun and educational project for my two sons to help me with. You gotta get creative to make sure they learn something during the summer.
How to Map the World’s Rivers
Step 1: Get the Shapefile
To start with, I had to find the Shapefiles for all of the world’s rivers. At least the big ones. As I mentioned, Allan Walker pointed me in the direction of NaturalEarthData.com’s 1:10m Physical Vectors, and uber map geek Nathaniel Kelso helped me find the files to download (he also runs a github account with links to download every NaturalEarthData download file). This resource is truly amazing – it has shapefiles for coastline, oceans, reefs, glaciated areas, and a few more – all freely available. I downloaded a zip file of rivers and lakes centerlines. Step 1 complete.
Step 2: Convert the Shapefile to CSV
This step used to be arduous and time consuming until Alteryx published a Shapefile to Polygon Converter to their Analytics Gallery. It’s a web app that requires a free login, and allows you to take that zip file you just downloaded and turn it into a CSV or a TDE (Tableau Data Extract). Most people are familiar with CSV, so let’s follow that option. Here is the CSV file that the Alteryx converter created for me. Here’s what the CSV file looks like – in particular, note the fields “Polygon ID”, “Subpolygon ID” and “Point ID”. They will play an important role in step 3:
Step 3: Connect Tableau to the CSV and create a Map
Now that you’ve got your CSV, it’s a fairly easy step to use it to create a map in Tableau. Start by connecting Tableau to this CSV, and then do the following in a new Sheet:
- Double click Latitude (goes to Rows) and double click Longitude (goes to Columns)
- Change Marks from Automatic to Lines
- Drag Polygon ID and Sub Polygon ID to the Detail Shelf
- Drag Point ID to the Path Shelf
To complete the viz, I colored the rivers by Scalerank, added two Quick Filters (Scalerank and river Name), formatted the Tooltips, and added the map along with a histogram of Scalerank to a dashboard. I asked my son Aaron to pick the title font, and he picked Brush Script MT because he said he thought the letters looked “rivery”. I couldn’t argue with that, so we made a PNG with transparency and added it as an image (because Brush Script MT isn’t a safe web font).
Now here’s to you, Mr. Robinson
I said I had a surprise, and here it is. I’ve been playing around with (read: obsessing about) different map projections lately. I figured out how to convert the latitude and longitude coordinates into x, y values of the Robinson projection, a projection that the National Geographic Society used from 1988 to 1998, before ditching it in favor of the Winkel tripel. I won’t get into too much detail here, but suffice it to say, the Robinson is a pseudocylindrical projection that’s really only suitable for creating thematic maps of the entire world. Compare it with other projections using this handy summary image. More to come on this soon, but for now, here is the rivers dashboard in the Robinson projection:
Notice that Greenland doesn’t loom as large as it does in the Mercator projection, which distorts it’s size quite a lot (it’s actually 1/8 the surface area of South America). Also notice, however, that the Robinson projection “curves” inward at both poles (latitude lines get shorter as you move away from the equator) – this means that if you were to zoom in to the street level in, say, Finland, streets that cross at a right angle in the real world wouldn’t appear to on the map. That’s what you get with Mercator in return for some area distortion. Every map has its pros and cons.
If you’re interested in building a Robinson projection yourself, here are the equations to make the conversion within Tableau. I recommend either drawing the coastlines and graticules yourself, or finding a good Robinson map image and adding it as a Background Image, fixing the position carefully. Here is the map image I used. It works fairly well when zoomed out to show the entire world, but I hid the Zoom controls since it really doesn’t work well when zoomed in.
Thanks for stopping by,
Since moving to the Seattle area in early 2013, we’ve been doing our best as a family to tromp our way through the lush, scenic trails around us, guided by a helpful little orange book entitled “Best Hikes with Kids: Western Washington & the Cascades“. At first, my gentle suggestion (read: stubborn insistence) to hit the trails was met with some light resistance (read: outright mutiny) from my two iPad-wielding boys, but not so much any more, I’m happy to say.
Being a data guy, I wanted to track our every step through the Pacific Northwest, so I downloaded an app for my iPhone called Backpacker GPS Trails Pro. It’s great – it tracks our coordinates and lets us capture photos or video along the way, among other things. The default dashboard on TrimbleOutdoors.com is nice and all, but, well – I told you I was a data guy – I wanted to make my own.
First, I’ll show you what I made to track our treks, and then I’ll show you how I managed to go from GPS to viz in 7 steps.
How to go from GPS to Viz – 7 Steps
STEP 1: RECORD the trip & SYNC the app
This is the best part. Get out there, enjoy the trail, and make sure to hit the start button on the app. Here are some screen shots from a recent trip we took:
STEP 2: DOWNLOAD the data in .gpx format
First, I had to log in to my account on TrimbleOutdoors.com. You may have a different GPS app, which is fine, but just make sure it’s one that allows you to download your data. If your app’s site lets you get the data in spreadsheet form, all the better. Mine didn’t, so I had to first get the .gpx file. Here’s a screenshot of the download page:
STEP 3: CONVERT the .gpx file to .txt
Next, I had to get the data into a text file, which was quite easy to do once I found a useful site called GPS Visualizer. It’s free to use (they accept donations), and you just indicate that you want a Plain text output, choose the .gpx file from your Download folder, check the boxes to add estimated fields, and Convert. Here’s how that looks:
STEP 4: CLEAN UP the data spreadsheet
This step involves opening the .txt file from step 3, getting rid of any header rows, moving the multimedia files to the bottom of the list, combining multiple .txt files into one spreadsheet and giving each its own unique hike name. Here is a screenshot showing the original .txt file and the fully formatted spreadsheet:
STEP 5: CONNECT Tableau to the cleaned up .txt file
Open Tableau Desktop (or Tableau Public), and click “Connect to Data”, select Microsoft Excel, navigate to your hike spreadsheet, drag the sheet into the middle area, and then click “Go to Worksheet”.
STEP 6: CREATE your viz
Use Tableau’s UI to drag and drop your data fields onto the canvas and create Sheets, Dashboards and Stories. I used a few advanced features in this workbook, including:
- A Custom WMS (only works with Tableau Desktop) from USGS – US Topographic Basemap. Click Map > Background Maps > WMS Server, and enter: http://basemap.nationalmap.gov/ArcGIS/services/USGSTopo/MapServer/WMSServer?
- A web page object on the Dashboard that dynamically links to each hosted photo on TrimbleOutdoors.com based on the URL column in my data. Note that I also changed the size of the photos since the large images took a long time for the Trimble Outdoors servers to load. Smaller photos were obtainable using a URL parameter (“?size=Size265x180″). Drag a web page object onto the dashboard, click OK, and then select Dashboard > Actions > Add Action > URL and fill out the dialog box as follows:
STEP 7: PUBLISH to Tableau Public & EMBED in your website
Can’t get much easier. Click Server > Tableau Public > Save to Web as… (or in Tableau Public, File > Save to Web as) and copy and paste the embed code into your CMS.
The Last Leg
This was a fun personal project that I made for my boys, so I took some extra steps to add design elements to the dashboard. I was shooting for a hand-made / trail map / scrap book feel, hence the hand-written font instructions, compass image, photo corner tacks, tally mark image, etc. The mountain shape cut-away at the top of the viz is actually from a photo of the Olympics here in Washington, so I tried to stay true to the territory with each design element.
Let me know if you make good use of this tutorial, and if you have any other questions about how I made it.
Thanks for stopping by, and happy trails!
Every data set contains a myriad of stories. I’m using the word “story” in a liberal way here, not necessarily in the “bedtime story” kind of way, or even the “headline news story” kind of way. By “story”, I simply mean a sequence of data-driven statements that progressively explain the world we live in.
With even simple data sets, these types of data stories abound, some more interesting than others. Whenever I run workshops along with my Tableau Public teammates, we’re amazed at how each group, given the exact same data set, comes up with unique insights.
Earlier this month the UN celebrated World Population Day – a day to “raise awareness of global population issues” according to its Wikipedia page. I decided to play a game and see how many different data stories I could tell with the a simple spreadsheet of population, birth rates and death rates for every country since 1960 as obtained from the World Bank’s online data repository.
I came up with six simple “types”: 1) change over time, 2) drill-down, 3) contrast, 4) intersections, 5) different factors, 6) outliers and trends. Use the tabs across the top to see the different stories, and use the tiles within each story to read each story point:
I ended this experiment with a feeling that I was just scratching the surface, and that there are many more data stories to be found and told from even this simple data set on world population.
I encourage you to consider these six story points types as thought-starters for whatever data set you are working on. Ultimately your data will have its own story, and it will likely be a combination of these building block story types and others that are out there. Also, help me out by downloading the workbook and see how many more you can tell. Leave a comment below with your version, or tweet me a link to it.
Thanks for stopping by,