(For this post, I owe a word of thanks to Andrew Beers – VP of Product Development at Tableau, for the raw data, and Mike Klaczynski – Data Analyst on the Tableau Public team, for showing me this method).
At the Seattle Hacks/Hackers event last night, we built an interactive data dashboard that allows the reader to explore bridges in the state of Washington, where a bridge crossing the Skagit River recently collapsed into the water after being struck by a truck carrying an oversize load.
What’s notable about this dashboard is that you can click on any of the 2,489 circles on the map and bring up an embedded Google satellite image of the bridge within the dashboard itself. I didn’t have to take a screen shot of each satellite image – that would be way too much fun. Instead, I used a little-known feature in Tableau Public – embedded web pages (similar to the Embedding YouTube post from a few weeks ago).
How to embed a Google satellite image in a Tableau Public visualization:
The first group of 5 steps shows you how to create a url for each bridge, and the second group of 5 steps shows you how to add a box to your dashboard to pull up the bridges.
I. Create the URLs
1. First, notice that the data file contains Latitude (“LAT”) & Longitude (“LON”) for each bridge.
2. A Google Maps search for a particular Latitude & Longitude (say, 48.445781 and -122.341108) yields a link url like this:
3. The url can be simplified a little bit as follows:
4. Breaking down the elements of the url, we can see that after the latitude & longitude, there are three parameters in the url:
- “q=48.445781,-122.341108″ – these are your coordinates. Note that if you have an address field instead of Lat/Long, you can put an address after “q=” as well
- &z=17 – this specifies the zoom level. Higher numbers zoom in, lower numbers zoom out
- &t=h – this specifies the type of map. (t=m is a map, t=h is a satellite view)
- &output=embed – this is a key parameter that makes sure the website you embed in your viz doesn’t include the entire site – just the map itself
5. You could then generalize the url to:
You can see that the actual numbers for Latitude and Longitude have been replaced with field names <LAT> and <LON>
The next group of steps walks you through how to add a box to your dashboard that pulls up this embedded satellite image when a user clicks on a particular circle.
II. Add dynamic Satellite Images to your Dashboard
1. First, in the dashboard tab, drag a Web Page onto your dashboard from the left-center panel (just leave the “Edit URL” dialog box blank and click “OK” for now):
2. From the Dashboard file menu, click Actions and click the “Add Action >” button and choose “URL…”
3. In the Add URL Action dialog box, select whatever sheet you have created that includes the fields LAT and LON, and choose what event you’d like to trigger navigation to the new image (Hover, Select, or Menu). In this case, I’ve selected Map as my Source Sheet and Select as my trigger event, but you could trigger the action from a table or other type of sheet. Here’s what the dialog box looks like:
4. Now comes the magic. Copy and paste the generalized url above to the URL field of the dialog box, and replace <LAT> and <LON> with the corresponding field names in your data source by clicking the small arrow to the right of the URL text entry field:
5. That’s it! Test it out by clicking on the map circles and see the satellite image change accordingly.
I can see this being useful for organizations that would like to include images of office locations or real estate assets in their dashboards. For data journalists, it’s about allowing readers to interact with the abstract and the real in the same graphic.
If you make a dashboard with a dynamic Google map, be sure to post the link in the comments field for all to see.
Added 6/10: Here are the slides from the event:
Being a Canadian (eh) living south of the border, I’ve watched the US political process as an outsider looking in for all of my adult life. It’s a fascinating system, with plenty of fine points and flaws, which just means it’s a human system.
I had the chance to visit and present data visualization using Tableau at the TechActivist conference this weekend. I learned a lot about how people deeply involved in the political machinery of this country think, relate to each other and approach their goals. There is no doubt that they all see data as a huge opportunity going forward.
To prepare for the conference, I was given Washington State election results from 2012. My colleagues at Tableau Mike Klaczynski, Jewel Loree and I spent some time playing with the data and mashing it up with census data to see if we could find anything interesting in the results. We presented a number of findings, and created this voting results analytics dashboard at a county level.
Click to see an interactive version, use the drop-down in the upper right to switch between Republican and Democratic perspectives:
Here are the slides I presented based on my cursory research into the subject of data visualization and US politics. I don’t claim to be an expert in politics, but I did find some interesting articles and visualizations that I felt compelled to share:
As always, feel free to leave comments, feedback, suggestions, etc. If you really want to get my attention, go get Tableau Public (it’s free), download the workbook (click “Download” in the bottom right corner of the dashboard) and remix the data to show it the way you’d like to see it.
Lastly, here’s a link to many other election day visualizations created by the aforementioned Mike Klaczynski.
Thanks for stopping by,
F. Scott Fitzgerald, author of The Great Gatsby and many other works of classic American literature, kept a fairly complete (though not always arithmetically accurate) ledger of the earnings he collected by title from the time he left the army until 1936, just a few years before his death. You can see the ledger at the University of South Carolina’s digital collections website here.
I was able to convert the record of the dollars he actually made to 2012 dollars using, appropriately, a website called “Westegg“. West Egg is the setting for the novel The Great Gatsby. I found that he made over $37K in 1931, or approximately $564K in today’s dollars. Not too bad. Of course the movie The Great Gatsby will likely net much, much more than Fitzgerald’s tally, but he wasn’t exactly a starving artist.
Here are his earnings visualized in three tabs: one showing a history over time, another showing the ledger for each year, and finally a third showing the top titles in terms of income collected by Fitzgerald:
A few notes on the making of this interactive graphic:
- Converting the data from pdf to spreadsheet form was painstaking work. I tried a few methods – pdftoexcel.com, saving the pdf to txt and then importing to Excel. These methods really didn’t work out too well. In the end, it was copy-paste from pdf to Excel, rearrange the fields to a raw data table, and then double/triple check the figures.
- Cross-checking the tallies, I found that many years Fitzgerald wasn’t as good at math as he was at writing fiction. No big surprise there, I suppose. It was funny to see that in one place, he actually blames a bout of bad health with his arithmetic errors.
- Categorizing the titles was a little tricky. For example, he used a category called “Books” in some years, and “From Books” in other years. I gave it my best shot to combine these where it seemed like it made sense, but his system would probably cause most accountants a good deal of heartburn. You will find “The Beautiful and Damned” in no less than four different categories – “Books”, “Movies”, “English Rights”, and “Miscellaneous”. This is the way Fitzgerald cataloged the income, so I tried to keep it as true to his record as possible.
Thanks for stopping by! If you’re going to see the movie this weekend, I hope it’s better than what the critics say,
Shan Carter, Kevin Quealy and Joe Ward of The New York Times recently published a thorough analysis of the rise of strikeouts in Major League Baseball. In it, they showed how the number of strikeouts per game has risen along with the number of pitchers per game using two line plots, one for each variable. It’s good stuff, you should read it. I especially like the grayed out dots for each team, which give a sense of the team-by-team variation without overwhelming the reader.
I found the summary table for average MLB game stats since 1871 here, and I wondered what this correlation, and other pairings of MLB stats, would look like if they were plotted as connected scatterplots. Connected scatterplots are a visualization form that have have been featured at NYT recently (more about this form of visualization, including a number of examples, in Alberto Cairo‘s blog post “In praise of connected scatterplots“).
Here’s what it looks like, along with a second method show below it, the dual axis line plot:
Effort and Reward
I struggled with connected scatterplots at first. Maybe the engineer in me stubbornly resisted the notion of including time on anything other than the x-axis. But I found that after investing a small but not insignificant amount of time in orienting myself to the axes, the connected scatterplot actually became a fun chart to explore. To quote Andy Kirk, my effort was “ultimately rewarded with a worthy amount of insight gained.” (Kirk, Data Visualization – a successful design process, p26).
The connected scatterplot imparts a sense of travelling a pathway through a terrain that has twists and turns, loops and sudden rises and falls that encode how the two different variables changed together. It’s a roller coaster ride of sorts, and once you’ve on-boarded the cipher of the code, you’re out of the turnstiles and on your way.
The Other Method: Dual Axes
You have to admit, though, the dual axis line plots below the connected scatterplot do a fine job as well. In fact, they probably require the reader to invest less time upfront to begin to glean some insight (sorry, no experimental data on that claim). If my feeling is right, it probably has something to do with the fact that we’re more used to seeing changes over time shown from left-to-right. It’s still an abstract way to represent time, it’s just one we’re more familiar with.
Virtues and Vices
The dual axis method has some distinct advantages: if you open up the year slider to show the entire range from 1871 to 2012, you will see what I mean. The connected scatterplot becomes much more difficult to read, but the dual axis line plot does not require any additional effort. You can adjust the slide in the interactive version above, or here’s a screen shot:
Additionally, not all pairs of variables render well in the connected scatterplot format, even with the shorter time window of 1981-2012. If one variable basically contains a bunch of random noise, or doesn’t change much at all, the connected scatterplot will look very jumbled, and will be hard to read since all the points will just form clumps. For example, change variable 1 to “Avg Pitcher Age” and change variable 2 to “Batters Faced”. What you get isn’t an exciting journey, it’s a wild goose chase, and you can see why if you take a look at the dual axis plot, which immediately tells the story – two flat lines:
In conclusion, my opinion at this point is that the connected scatterplot is a special case visualization type for showing how two variables change together over time – if it works well, it really works well. If it doesn’t, ditch it for the more all-purpose (and admittedly more utilitarian) dual axis line plot. I guess to go along with the baseball theme, my advice would be to swing for the fences if the pitch is right, otherwise just make contact and get on base.
How I made the connected scatterplot in Tableau
This section will serve as a very brief how-to for making a connected scatterplot in Tableau Public. The key is dragging “Year” to the “Path” mark landing pad.
Here are the steps:
- Drag the first measure you want to use to the “Columns” shelf and the second to the “Rows” shelf
- Convert both from SUMs to Dimensions by clicking in the down arrow of the pills and selecting “Dimension” (now you have a basic scatterplot)
- Change the Marks type from “Automatic” to “Line” and drag “Year” to the Path landing pad
- Also drag Year to Label
Of course I used Parameters to allow the reader to control the two variable types, and I also used a dual axis to format the data points but the above steps do the trick.
Here’s a screen shot of the final connected scatterplot sheet that I used:
Thanks for stopping by, and I’d love to know your thoughts on the virtures and/or vices of connected scatterplots,
I had the pleasure of presenting a data visualization workflow to the Boston Predictive Analytics Group with Tanya Cashorali last night (huge thanks to Bocoup for providing the meeting space and John Verostek for organizing the event).
The workflow we presented involves using a library called PitchRX in R to scrape pitch data from the PitchFX database, which Tanya covered (she also has a write-up on her website sportsdataviz.com), and then connecting Tableau Public to the data set to see what’s going on. We mined and visualized pitches by Jonathan Papelbon from the 2008 through the 2012 MLB season, or around 6,000 pitches in total.
Here’s the dashboard I put together for the group:
This is a first cut at what an exploratory (as opposed to explanatory) dashboard could look like, and I’m not quite sure what all the stories in the data are yet, but here are some tidbits that popped out to me:
- If you want a good chuckle, select just the pitches that resulted in balls and check out the gift Brandon Boggs was handed by the umpire on the 0-1 fastball of his at-bat during the 9th inning. That’s the nature of the game, I suppose.
- Also, filter to just strikes and ponder what Sean Rodriguez was thinking when he swung at the 0-2 pitch in the dirt during the 16th inning. Maybe I’d swing at just about anything in the 16th inning too, so I should be careful to criticize.
- Next, I was surprised to see that Papelbon actually threw more pitches to left handed batters (54% of all pitches) over the course of the past 5 seasons. Really, more lefties?
- Lastly, sliders are almost exclusively thrown to right handed batters (81% of all sliders were pitched to righties). That’s a good insight for the scouting report, I’d imagine. I’m guessing baseball geeks will be able to find a ton more here.
The real point here is that there’s room for multiple tools in every data worker’s toolkit. Tanya and I showed how you can combine different tools in a complementary way to get the best results. In this case, R does all the plumbing, and Tableau handles the fixtures and window dressing.
Thanks for stopping by, let me know if you have any feedback about the dashboard, or if you’d like to see the how-to.
…and the only prescription is more data.
Check out these regularly published data and data visualization features and roundups. You’ll feel better in no time.
If you know of other data pills to take, leave a comment!
Today (April 13th) marks the 50th birthday of the 13th World Chess champion, Garry Kasparov of Russia. Garry wrote a book called “How Life Imitates Chess” that I highly recommend – in it he gives a window into his upbringing and his professional life as chess player, and how he not only became the youngest undisputed World Chess Champion in 1985 at age 22, but how he maintained the world #1 ranking for 255 months. It’s a great read because he talks about how his approach to dominating the chess world can transfer into other arenas of life, including his present struggle as political activist advocating for true democracy and human rights as chairman of the United Civil Front in Russia.
Here are some of my favorite quotes from the book:
“The virtue of innovation only rarely compensates for the vice of inadequacy.”
“We must all walk a fine line between flexibility and consistency. A strategist must have faith in his strategy and the courage to follow it through and still be open-minded enough to realize when a change of course is required.”
“Questioning yourself must become a habit, one strong enough to surmount the obstacles of overconfidence and dejection.”
When I read this book, I was struck by how similar playing chess is to visualizing data. Both are activities that present us with a myriad of options, strategies, and tactics – some more well-advised than others. There is a highly experiential aspect, where the more one participates in the activity, the more one has a sense of what will work well in a given situation – a way of narrowing the option space.
I wrote more about my thoughts about the similarities in a blog post called “How Data Visualization is Like Chess“. It was the most enjoyable to write, by far. One of my most viewed data visualizations is “The Best Chess Openings“, so I know I’m not the only data viz enthusiast who also likes chess.
My son Aaron is a lego pro. He also likes movies and his iPod. Put those three things together and you get stop motion Lego movies made with an iPod! Genius. He started uploading his videos recently to his new YouTube channel (yeah, that took mom and dad a while to get used to…) so I thought I’d embed his videos in a Tableau Public dashboard, and then show you how it was made:
In Tableau Public, connect to your spreadsheet, make the chart you want to use to change the video link (in this case, I made a simple bar chart showing the views of each video), and make sure you drag the “Embed” dimension field into Detail, like this:
Next, make a new dashboard, add your chart, and drag a “Web Page” object where you would like to place it (either “Tiled” or “Floating”, whichever you would prefer). When the “Edit URL” dialog box pops up, just leave it blank and click “OK” for now:
Click one of the bars, and you should see the video load in the Web Page object box. Viola! Data viz and video, all in one.
As for the “Aaron’s LegoTube” sign, I used this fun Chrome Lego Builder page to make that.
On some level, we all know that the data we’re using to make conclusions about the world isn’t perfect. We know that there remains some uncertainty about everything the human mind considers. From survey results to clinical studies to engineering bridges, there is always some error involved in the numbers. We tend to neglect this uncertainty, and lead ourselves and others astray as a result.
Case in point: fish labels.
On Fish Labels
When Oceana published the findings of their seafood labeling fraud investigation last month, the results were shocking: “more than 1,200 seafood samples taken from 674 retail outlets in 21 states” yielded a disturbing trend – over 33% of DNA samples didn’t match their label. You can read the news on the Oceana website here.
I first heard about the study while driving to work in Seattle after coming back from the Tapestry Conference in Nashville. At Tapestry, we had discussed uncertainty following Jonathan Corum’s keynote, so the topic was fresh in my mind.
An Inferential Leap
Northwest Public Radio had the following to say about the study: “Seattle and Portland are among the best cities in the country to buy fish that is accurately labeled.” On the surface, it made perfect sense. Seattle and Portland are coastal cities with robust fishing industries. Of course they’d be better than cities like Austin or Denver. The NPR article went on to state that the lower rates may be due to “consumer awareness about seafood in Seattle.” Flattering.
A Look at the Numbers
For fun, I thought I’d take a deeper look, so I found the full report by Oceana here. Let’s take a look at the report to see what can be said about Seattle and Portland, if anything.
If we just look at the overall percent of samples mislabeled by city, we find Seattle and Portland among the best, along with another famous North American fishing hub, Boston:
Case closed then, right? If this were all we were given, we’d make the same inferences as Northwest Public Radio. But were the cities sampled properly to make this statement?
Samples were taken from three types of retail outlets: grocery stores, restaurants, and sushi venues. Here are the results by city and retail category. Mislabeled samples are red bars, and correctly labeled samples are blue bars:
We can see that sushi venues yielded the poorest results, with over 73% mislabeled across all cities (some of the sushi mislabeling was due to “foreign name translation” – e.g. not all types of fish called “yellowfish” in Japan meet the FDA classification).
But the other thing we notice is that very different amounts of sushi were collected in each city. In fact, no sushi was collected in Boston at all.
Breaking Down the Mix
Here is a breakdown of the mix of each retail category in each city’s sample set (thickness of the bars is proportional to mislabeling – thicker meaning a higher mislabeling rate):
So, relatively low amounts of sushi samples were sampled in the Seattle, Portland and Boston. 16% of the samples in Seattle were sushi, while over 35% of the samples in Southern California were sushi, by comparison.
Oceana didn’t follow a stratified sampling plan when they collected their 1,214 samples and as a result, the overall mislabeling rates from each city really aren’t apples-to-apples. This doesn’t mean their study is meaningless, it just means that comparing the overall rates between cities isn’t all that valid. It would be like comparing average human heights in each city, and including way more children in one city’s sample set than the others. It’s just not fair dice.
Comparing Like to Like
Okay, since we can’t really compare the overall rates, what if we just compare the cities within each retail category: so grocery stores to grocery stores, restaurants to restaurants, and sushi to sushi?
Even though a relatively high amount of samples were taken overall, the sample sizes start getting fairly small when you look at each city/category combination, so we should add error bars to the mislabeling rates. This is a case for the binomial proportion confidence interval. There are a number of different ways to compute this interval, but for now we’ll stick with the normal approximation that we all learned in college.
I’ll follow up with a how-to post next. But for now, here is the breakdown of mislabeling rates, with uncertainty taken into account:
This data visualization tells a very different story. Notice that not every city is included in this chart. That’s because in some cases, there weren’t enough samples to satisfy the requirements of the normal approximation (n*p>5 and n*(1-p)>5), so I filtered these cases out of the chart. Kansas City drops out altogether, for example. Not enough samples in KC (19, 9 and 9) to say much of anything about labeling there.
What can we say about the different cities? Here’s what we can (and can’t) say based on a 95% confidence interval (ignoring the difference in the types of fish samples collected at each place):
- No city is better or worse than any other in sushi mislabeling
- Restaurants in Chicago had lower mislabeling than restaurants in Northern California
- Grocery stores in Seattle had lower mislabeling than in California (Southern & Northern) and New York.
So some comparisons can be made between cities, just not all that many. In the end, Seattlites can take consolation in the fact that the fish in their grocery stores is labeled more accurately than in California and New York, and perhaps this is even partly due to their seafood IQ.
Oceana revealed widespread mislabeling of fish in this country – that can’t be denied. But a massive inferential leap was made in reporting the story. Looking at the numbers through the lens of statistics allows us to make more accurate statements about the findings. Yes, this involves much more work than simply taking the overall mislabeled rate and slapping it into a map or bar chart. And yes, uncertainty can be annoying. But it’s really just freshman stats 101.
Embracing uncertainty just may mean the difference between truth and fiction, and we wouldn’t want fishy results, now would we?
This past week the inaugural Tapestry Conference was held in Nashville, Tennessee at the Union Station Hotel. Elissa Fink and Ellie Fields of Tableau, who I’m fortunate to work with, envisioned and organized the event. Around 100 journalists, researchers, bloggers, academics, and practitioners attended the event, which focused on storytelling with data.
Simply put, it was mind-blowing. What transpired there was so incredible, that I feel compelled to share some of the most salient points with you.
An emerging art form
One of the overall themes is that telling stories with data is important, but doing it well isn’t easy. It requires an analyst’s mind, a designer’s eye, a journalist’s pen, and a human’s heart working in harmony to weave a tapestry of insight. When it is done well, it’s beautiful to behold, and we saw a number of examples from the work that was presented.
Key Point #1: Tools are great, but it isn’t about the tools
Keynote speaker Jonathan Corum of the New York Times made it clear from the onset that we shouldn’t let technology drive. The focus shouldn’t be on the tools, nor should it be on the data, but rather it should be on the audience. Whatever tools you use to tell your story, the audience you design for should walk away feeling enlightened. Jonathan’s full slide presentation from his keynote is available online here.
Key Point #2: Hone in on the story
It can often be tempting to just quickly connect to data, create an elaborate dashboard, and publish to the web, convinced that your audience will be as enraptured with the result as you are. But, as Seattle Times Data Enterprise Editor Cheryl Phillips stated, “data without a theme is not a story”. She encouraged us to “avoid notebook dump” with data, and focus on the “nutgraf”, or the editorial heart of the story. The story is in the patterns, and as keynote speaker Pat Hanrahan of Stanford reminded us, “showing is not explaining.”
Key Point #3: Context is the key to understanding
Nigel Holmes brilliantly illustrated the power of context by asking two demonstrators to hold a 29 foot string across the front of the room in order to help us understand just how far Bob Beamon jumped in the Olympics in 1968. His point: we more fully grasped the magnitude of Beamon’s feat because we saw the distance in relationship to something else – the room we were sitting in (and how far Nigel himself was able to jump). Likewise, an infographic that shows the Queen Mary flipped upright and positioned next to famous buildings like the Empire State Building works because it takes the Queen Mary out of its original context – the harbor – and puts it into a different one – the city. We understand the size of the ship in a whole new way, and we feel enlightened as a result.
Key Point #4: It’s okay to be inspired by other people’s work
Hannah Fairfield of the New York Times talked about the evolution of the connected scatterplot graphics “Oil’s Roller Coaster Ride”, “Driving Shifts into Reverse”, and “Driving Safety, in Fits and Starts”. She related how each of these projects inspired the subsequent version. She also talked about the instant classic “Snow Fall” and how the group that created it was inspired by the book “The Invention of Hugo Cabret”. My take-away was that it’s good to have sources of inspiration, and we should always ask ourselves “what’s next?” The more we observe each other’s work and build off of it, the better we will become.
Key Point #5: Critique should be helpful and constructive
Bryan Connor of The Why Axis talked about how criticism in the field of data visualization has often intimidated newcomers, and in order to prevent this from occurring, critics should behave more like investigators than psychics, asking “why” a designer did something a certain way instead of making assumptions and passing judgments haphazardly. The importance of this point can’t be missed. Knowing the designer’s goal enables the critic to give better criticism. Feedback is important for the development of any field, so data storytellers should learn to both give and receive feedback well.
Key Point #6: We are narrative seeking creatures
Robert Kosara of Tableau and eagereyes.org introduced a helpful four quadrant model of data visualization that places “Visual Data Stories” in the top right quadrant since these types both tell a story and include depth that affords the reader the opportunity to explore their own stories. Again, achieving this quadrant isn’t easy to do. Kosara effectively challenged his audience to define what this quadrant looks like by leaving a conspicuous question mark in the upper right corner. As data visualizers, we are constantly publishing content – if we ask ourselves how we can better explain and at the same time allow the user to more deeply explore the story in the data, we’ll be moving in the right direction. Robert also posted some photographs taken at the event here.
Key Point #7: Pictures have incredible power
Cartoonist Scott McCloud brought the art of storytelling with pictures to a whole different level by showing the incredible power of images.
“All pictures are words. All pictures speak. All pictures have something to say.”
We witnessed the amazing ability to convey emotion through facial features, saw how our minds seek to impose a narrative to any two pictures shown side-by-side, and learned the power of drawing outside the box and challenging conventions. Scott’s presentation was nothing short of stirring, and I walked away feeling that I could communicate more with images than I ever imaged, including images that tell stories with data.
I feel very fortunate to have been at Tapestry, and I’m thankful for all the hard work the speakers put into preparing their materials. I was happy to be able to get involved as a contributor to the @tapestryconf twitter account, which we will be keeping live going forward. On a personal note, it was a thrill to meet with people in this field that I have been following and interacting with online for a while now. I’ve benefitted greatly from contact with them, and I can only wait in anticipation for next year…