Any time you find yourself with a data set with more than a few columns, say for example the 2012 Major League Baseball regular season stats sheet, you have a number of options when it comes to data exploration and discovery. The data set has all “qualifying” players, their team & position, and their number of games played, at bats, hits, homeruns, doubles, RBIs, etc. for the 2012 regular season. You can find the table online here.
Option #1: Table Madness
You could just sort and filter the table, and maybe even color-code the cells with some fancy conditional formatting in Excel (Home > Conditional Formatting > Color Scales > Red – White – Blue Color Scale). Interesting approach, but this doesn’t make use of the human brain’s superior capacity to make use of position or size to compare values. I’ll save it for all those dense quarterly financial spreadsheets. Here it is, for posterity:
Option #2: “O-F-A-T”
One option (called “O-F-A-T”, or “One Factor at a Time”) is to view each parameter, say, Homeruns, all by itself in a basic chart like a histogram. Rinse & repeat for all the other parameters. As nice as that is, you won’t really know anything about potential correlations between variables, you’ll just know each variable in isolation:
Option #3: Small Multiples
You could make a whole panel of histograms of each parameter, which would really just be a more efficient way of Option #2. Or, as demonstrated in Chapter 6 of Visualize This, you could plot each variable against each other variable in a scatterplot matrix (one form of small multiples chart). This option is preferred by many (created here using the R function “pairs()” which includes fitted LOESS curves), as it allows you to quickly scan for strong correlations between two variables (in this case, I reduced the number of variables in the matrix from 16 to 10 for readability):
A fairly quick scan shows that Homeruns and RBIs are fairly closely correlated in a positive sense (as are At bats and Runs). Can you spot the negative correlations? Not that hard to notice that Homeruns and Stolen Bases seem to be negatively correlated, right? To a fan of the game of baseball, none of these correlations will be particularly surprising, but the power in this approach is that many, many correlations can be identified, at least roughly speaking, in a very short amount of time. Do a quick Google Image search for “small multiples” and you’ll find other great examples of this powerful visualization. Here’s a recent piece by the New York Times showing drought conditions in the US since the late 19th century. Bottom line: I love small multiples.
Option #4: Interactive Visualizations
Lastly, you could create an interactive data discovery dashboard (created here using Tableau Public), using Parameters to allow the explorer of the data set to dynamically assign any of the 16 variables to the x-axis, the y-axis, the color palette or the circle size:
This effectively allows the user to compare any of the 4 parameters together. Of course, since position is more easy to discern than either color or area, the true comparison is still between the variable on the x-axis and the variable on the y-axis. Still, the other two variables are “along for the ride”, aiding a secondary analysis. The interactive environment also allows for the user to hover the mouse over any data point and learn more about the datum, or filter the view (in this case by either position or team).
So, Which Do You Prefer?
What I’d really like to know is: between Option #3 (small multiples) and Option #4 (interactive visualizations), which do you prefer, if any, and why? Do you prefer seeing it all at once at a high level, or do you want the ability to drill down, filter and customize a more limited but intricate view? Or are these totally different tools for totally different purposes that shouldn’t really be contrasted?
Thanks for stopping by!