First, apologies for the blog post drought! It was that kind of summer. It’s good to be back, though, and I hope you’ve been well.
Scatterplots are my favorite visualization type, hands down. From my very first interactive data graphic about The Great One to the most recent visualization below on major league pitchers, I’ve learned a great deal from these Cartesian classics over the years. In this post I’ll show you how to make them even better than the standard ones in Tableau.
Recently, Shine Pulikathara published a scatterplot of NFL player heights and weights that included two marginal histograms – one for each axis. I tweeted that I liked it, and Lynn Cherny replied that it’s pretty common to see this kind of thing in R:
@DataRemixed those are pretty common in R plots 🙂
— Lynn Cherny (@arnicas) September 14, 2015
She’s right, and it turns out that it’s also a common convention with other statistical graphing platforms, like Matlab and Plotly. It’s called a Scatterplot with Marginal Histograms. While Tableau has scatterplots and histograms as standard chart types, it doesn’t automatically combine them for you into a single view. The goods news, though, is that it’s fairly easy to combine them using a dashboard with three sheets. There’s only one small trick to make the charts interact the way you want, which I’ll cover below. If you want to follow along, download 2015pitchingstats.xlsx.
First, here is the finished version, showing pitchers “skill” (Earned Run Average, or ERA) and “luck” (Runs Scored by their team, or RS) so far in the 2015 season:
Now, let’s consider the four easy steps to create a scatterplot with marginal histograms:
Step 1: Create the Three Sheets
This part is fairly straightforward – create a scatterplot and two histograms as three separate sheets in the same workbook. To create the scatterplot, drag ERA to Columns, RS to Rows, W% to Color, Player to Label, and then add two Average reference lines, like this:
Next, to create the first histogram, create a new sheet, click on the Measure (say, ERA), click Show Me in the top right, and then choose Histogram. Do the same in another new sheet with RS, but click the Rotate icon in the top icon bar to flip the RS histogram 90°. Notice that two new data fields appear in the Measures area: “ERA (bin)” and “RS (bin)”. Right click to edit these fields and change the “Size of bins” to be 0.25 and hide the axes.
Step 2: Add the Histogram Bin Dimensions to the Scatterplot Chart Detail
Without this step, you won’t be able to get the sheets to interact together in the dashboard. Go back to the scatterplot sheet you created in Step 1 and drag both “ERA (bin)” and “RS (bin)” to Detail. You should now see these two fields listed in the Marks card area:
Step 3: Add the Three Sheets to a Dashboard
Next, create a new dashboard and add the three sheets you created in Step 1. Aligning the histograms with the scatterplot is the one messy part of this method. Add blanks to the left and right of the ERA histogram, and above and below the RS histogram. Drag the blanks until the extreme bars of the histogram align with the extreme points of the scatterplot:
Step 4: Create Two Highlight Actions:
The last step is to get the sheets to interact with each other. There are lots of ways they could potentially interact, but here’s what I’d like to see happen:
- When I hover my mouse cursor over any of the histogram bars, the corresponding circles on the scatterplot highlight
- When I hover my mouse cursor over any of the scatterplot circles, the corresponding histogram bars highlight
To do this, create two new dashboard actions by clicking Dashboard > Actions > Add Action > Highlight, and fill out the dialog boxes as follows:
That’s it! For finishing touches, I added a title, lead-in paragraph, data source and last accessed note, four area annotations to define the four quadrants, and two mark annotations to call out points of interest. I also edited the two Average reference lines to uncheck “Show recalculated line for highlighted or selected data points”. This was strictly a matter of preference, and you may not decide to modify the reference lines in that way.
Here are a couple other variations that don’t involve the binning concept inherent in histograms, and therefore don’t required Step 2 above:
Scatterplot with Marginal Box-and-Whisker-Plots
Scatterplot with Marginal Hash Lines
Thanks for reading! I hope you found this helpful. Let me know if you have any further tips by leaving a comment. Also, I’m curious, which of the three variations – marginal historgrams, box plots, or hash lines – do you prefer?
A great post Ben. I can see some immediate use cases for this at work. While the bar-code (i.e. hash lines) version looks cool, I definitely prefer the histogram and box/whisker versions as they encode the data in a more digestible way – the clustering is more obvious. And for data sets that may be highly skewed, I can see where the box/whisker model would be preferable in that you could see and interact with every value, whereas the histogram might make it hard to see and click on the outliers.
Thanks Michael! I agree, the ability to interact with every value, combined with the additional information about quartiles and outliers, make a good case for the box-and-whisker plot as the best of the three. I suppose the histograms, while not letting you see each value on the margins, do let you see the highlighted groupings within each bin.
I always love the posts regarding Baseball and Sports.
Ben, Thank you for sharing!
Ben, great article!
I created a third variant with marginal histogram+boxplot
Based on an approach described here
Very interesting, thanks for sharing Alexander! At first, the visual noise of the technique bothered me somewhat, but after looking at it more, the transparency does allow the reader to see all of the marks. If there were only one histogram, I’d definitely put the histogram below the x-axis labels, or in the white space above the histogram to avoid the partial occlusion, but with many rows of histograms, as you’ve shown, overlaying the box plot does add value. You can see where the 25th, 50th, and 75th percentiles are located.
Those visual properties can be fine tuned to make the boxplot the least obstructive. Here is one of my experiments:
Great post, I found it a bit different compared to my case and thought you might have a solution for me as my graphs are dynamic and I just want to synchronize the x-axis of two plots.
Here is my post link at Tableau forum and really appreciate to have your comment on it.
Pingback: Students in HE – some interesting bits.. | VisualisingHE