First, apologies for the blog post drought! It was that kind of summer. It’s good to be back, though, and I hope you’ve been well.
Scatterplots are my favorite visualization type, hands down. From my very first interactive data graphic about The Great One to the most recent visualization below on major league pitchers, I’ve learned a great deal from these Cartesian classics over the years. In this post I’ll show you how to make them even better than the standard ones in Tableau.
Recently, Shine Pulikathara published a scatterplot of NFL player heights and weights that included two marginal histograms – one for each axis. I tweeted that I liked it, and Lynn Cherny replied that it’s pretty common to see this kind of thing in R:
@DataRemixed those are pretty common in R plots
— Lynn Cherny (@arnicas) September 14, 2015
She’s right, and it turns out that it’s also a common convention with other statistical graphing platforms, like Matlab and Plotly. It’s called a Scatterplot with Marginal Histograms. While Tableau has scatterplots and histograms as standard chart types, it doesn’t automatically combine them for you into a single view. The goods news, though, is that it’s fairly easy to combine them using a dashboard with three sheets. There’s only one small trick to make the charts interact the way you want, which I’ll cover below. If you want to follow along, download 2015pitchingstats.xlsx.
First, here is the finished version, showing pitchers “skill” (Earned Run Average, or ERA) and “luck” (Runs Scored by their team, or RS) so far in the 2015 season:
Now, let’s consider the four easy steps to create a scatterplot with marginal histograms:
Step 1: Create the Three Sheets
This part is fairly straightforward – create a scatterplot and two histograms as three separate sheets in the same workbook. To create the scatterplot, drag ERA to Columns, RS to Rows, W% to Color, Player to Label, and then add two Average reference lines, like this:
Next, to create the first histogram, create a new sheet, click on the Measure (say, ERA), click Show Me in the top right, and then choose Histogram. Do the same in another new sheet with RS, but click the Rotate icon in the top icon bar to flip the RS histogram 90°. Notice that two new data fields appear in the Measures area: “ERA (bin)” and “RS (bin)”. Right click to edit these fields and change the “Size of bins” to be 0.25 and hide the axes.
Step 2: Add the Histogram Bin Dimensions to the Scatterplot Chart Detail
Without this step, you won’t be able to get the sheets to interact together in the dashboard. Go back to the scatterplot sheet you created in Step 1 and drag both “ERA (bin)” and “RS (bin)” to Detail. You should now see these two fields listed in the Marks card area:
Step 3: Add the Three Sheets to a Dashboard
Next, create a new dashboard and add the three sheets you created in Step 1. Aligning the histograms with the scatterplot is the one messy part of this method. Add blanks to the left and right of the ERA histogram, and above and below the RS histogram. Drag the blanks until the extreme bars of the histogram align with the extreme points of the scatterplot:
Step 4: Create Two Highlight Actions:
The last step is to get the sheets to interact with each other. There are lots of ways they could potentially interact, but here’s what I’d like to see happen:
- When I hover my mouse cursor over any of the histogram bars, the corresponding circles on the scatterplot highlight
- When I hover my mouse cursor over any of the scatterplot circles, the corresponding histogram bars highlight
To do this, create two new dashboard actions by clicking Dashboard > Actions > Add Action > Highlight, and fill out the dialog boxes as follows:
That’s it! For finishing touches, I added a title, lead-in paragraph, data source and last accessed note, four area annotations to define the four quadrants, and two mark annotations to call out points of interest. I also edited the two Average reference lines to uncheck “Show recalculated line for highlighted or selected data points”. This was strictly a matter of preference, and you may not decide to modify the reference lines in that way.
Here are a couple other variations that don’t involve the binning concept inherent in histograms, and therefore don’t required Step 2 above:
Scatterplot with Marginal Box-and-Whisker-Plots
Scatterplot with Marginal Hash Lines
Thanks for reading! I hope you found this helpful. Let me know if you have any further tips by leaving a comment. Also, I’m curious, which of the three variations – marginal historgrams, box plots, or hash lines – do you prefer?
I spend a lot of my free time with my family in nature, and it occurred to me recently that there is something particularly captivating about data visualizations that resemble nature. It makes sense if you think about it. As a species, we’re awed and inspired by majestic landscapes, and we’re drawn in by the intricate patterns we see in the world around us.
It’s the same with data visualizations. We use words like “enlightening,” “stunning,” and “beautiful” to describe the really good ones. Some people tire of the use of these adjectives, but I think there’s a good reason they have found their application here. They describe our inner experience.
To illustrate the point, I chose five visualizations published to the new Tableau Public “Greatest Hits” gallery and found five corresponding images from nature.
Here is the same content posted to Slideshare:
I’m not suggesting we all go out and start making visualizations that look like sand dunes or tadpoles, as appealing they might look. And relatively few visualizations will work in “petri dish” form, though one about rapidly growing companies just might be in that small set. I’m just sharing my observation that I tend to stop and look closely at patterns that are familiar.
And we can take an important cue from nature as well. Many of nature’s patterns are both mesmerizing and incredibly informative, like the tree rings that encode important data about the life of a tree. If you can inspire as well as inform, then why wouldn’t you do both? Nature does.
Thanks for stopping by,
William Zinsser, 92, died last week. Zinsser, author of On Writing Well, the classic guide to writing nonfiction, has been an inspiration to writers and aspiring writers since he first published his manual in 1976. Douglas Martin of the New York Times has written an excellent obituary on Zinsser.
1.5 million copies of On Writing Well have been sold to people all around the world who care about “getting the words right”, as Ernest Hemingway put it. I first read On Writing Well before launching this website four years ago, and I referred to it countless times while writing my first book, Communicating Data With Tableau (O’Reilly, 2014). Zinsser’s book taught me to respect sentences and words more than I had before. Or, to put it another way, it made me realize that my writing sucked. It was a harsh realization at the time, but I needed to know that upfront. I’d like to think that my writing sucks a little less thanks to Zinsser.
As I read On Writing Well it struck me that his advice for communicating well with words applies directly to the craft of communicating visually with data. His seven principles in Part I – The Transaction, Simplicity, Clutter, Style, The Audience, Words, and Usage – could be written about visualizing data as well.
Let’s call it On Visualizing Data Well:
1. The Transaction
Zinsser opened his classic book by teaching that “the product that any writer has to sell is not the subject being written about, but who he or she is.” (emphasis mine). The transaction, then, is a personal one – the reader is drawn in by the “enthusiasm of the writer for his field,” and the two most important qualities that result are “humanity and warmth.”
The same is true when someone examines the product created by a data visualizer. The humanity of the visualizer should shine though:
- What does the person creating the visualization think about the topic?
- Why does he or she care about this topic?
- How does he or she feel about it?
- Do you know something more about them, not just their topic?
Take another look at one of the most talked about visualizations of 2014, Periscopic’s animated U.S. Gun Deaths. Regardless of what you think about the design, form or aesthetic of the final product, you can’t help but feel the emotions of those who created it. Is there any doubt about what they’re trying to say, and more importantly, why they’re saying it? Kim Rees and her team brought their own humanity to the transaction:
Zinsser taught that “the secret of good writing is to strip every sentence to its cleanest components,” and that there are a “thousand and one adulterants that weaken the strength of a sentence.” Here’s an example he gives of “the clotted language of everyday American commerce:”
“The airline pilot who announces that he is presently anticipating experiencing considerable precipitation wouldn’t think of saying it may rain.”
It’s easy to laugh at this bloated phrase because we see it all the time, and we even fall prey to it ourselves. The same is true when we communicate with data. The lesson is that we shouldn’t overcomplicate the message.
This lesson is often misunderstood to mean that we should dumb down the message, or only choose simplistic messages in the first place. This interpretation is wrong. Just as a writer sometimes seeks to articulate a profound thought, we sometimes seek to show relationships that are complex. That’s okay, and we shouldn’t shrink from that challenge in the name of simplicity. But if there’s a clear way to show it, then we should show it clearly. In the words of Albert Einstein, “everything should be made as simple as possible, but not simpler.”
Zinsser wrote “writing improves in direct ratio to the number of things we keep out of it that shouldn’t be there.” He opens his third chapter with a funny example from the annals of U.S. history:
“Consider what President Nixon’s aide John Dean accomplished in just one day of testimony on television during the Watergate hearings. The next day everyone in America was saying ‘at this point in time’ instead of ‘now’.”
His admonition is to “examine every word you put on paper.” When working with his students at Yale, Zinsser would “put brackets around every component in a piece of writing that wasn’t doing useful work.” Sound familiar? Edward Tufte’s notion of chartjunk is the same notion. Designers and artists celebrate the white space in their creations. Obviously we shouldn’t remove every pixel, just the ones that aren’t doing any work. The trick is knowing which is which.
Next Zinsser addresses the objection that reducing a writing product to its simplest form leaves no room for style. He concedes that “simplicity carried to an extreme might seem to point to a style little more sophisticated than ‘Dick likes Jane’ and ‘See Spot run’.”
In data viz, the corollary to these preschool sentences is the bar chart. Simple, easy to understand, but no flair. We’re in familiar territory. It’s the never-ending “clarity vs. beauty” debate. But it’s a false dichotomy. Clarity and beauty are not mutually exclusive. Of course we can achieve both. Information visualization design firm Accurat does it all the time. Here’s an example of their work:
How can we achieve both simplicity and style? Clarity and beauty? Zinsser’s advice for writers applies to us, too. There’s a reason his chapter on style follows the previous two. A singer with loads of personality who sings out of tune won’t sell records. A carpenter who adds bevels and carvings galore to a chair that doesn’t hold your weight won’t stay in business for long. “This is the problem of writers who set out deliberately to garnish their prose.” Zinsser uses the wood-working analogy to show us the way:
“Extending the metaphor of carpentry, it’s first necessary to be able to saw wood neatly and to drive nails. Later you can bevel the edges or add elegant finials, if that’s your taste. But you can never forget that you are practicing a craft that’s based on certain principles.”
To create a data visualization that is both clear and beautiful, we first must get the raw materials and basic proportions right. Only then we can add what Willard Cope Brinton calls “judicious embellishment of charts”. What’s judicious? Fortunately, as Zinsser puts it, “there is no style store”, and you’ll have to answer that for yourself. Your audience will also have an opinion on the matter.
5. The Audience
Speaking of audience, Zinsser addresses this critical element in the fifth chapter of his masterpiece. We often talk about “knowing your audience” in data viz, and user-centered design in product development. It’s a very popular topic. I even give similar platitudes in the first chapter of my own book.
But Zinsser gives what at first seems like shocking advice on this subject. He says:
“You are writing for yourself. Don’t try to visualize the great mass audience. There is no such audience – every reader is a different person.”
Only write for yourself and don’t even consider who will see your work? Really? He clarifies by differentiating between a mechanical act (“work hard to master the tools”) and a creative act (“the expressing of who you are”). If you lose someone through sloppy workmanship, then it’s your fault. If you lose someone because they don’t like what you have to say, don’t worry. “You are who you are, he is who he is, and either you’ll get along or you won’t.”
In other words, care about your audience’s ability to decipher your message, and get that part right, but don’t care about whether they’ll agree with you or like you. Just say what you need to say based on what you find in the data.
Zinsser’s sixth chapter, entitled Words, deals with avoiding “cheap words, made-up words and cliché that have become so pervasive that a writer can hardly help using them.” His advice: “You must fight these phrases or you’ll sound like every hack.”
Do we have clichés in the world of data viz? Yes, we do. We all seek to imitate others in some way. The cliché in any field is just the tacky or ineffective element that people continue to use in spite of the fact that it’s bad, just because others use it. Think Periodic Tables and Subway Maps.
At the Tapestry Conference in 2014, Martin Wattenberg and Fernanda Viegas gave a presentation on genres in data visualization. They explained how we often use a shared language that gives our readers shortcuts to understanding. While this can be good, we are in danger of getting stuck in these genres, which can become formulaic.
Here’s their full presentation:
Wattenberg and Viegas say the key is awareness. Awareness of the elements of the genre or genres we’re in, and those elements to which we really shouldn’t adhere, because they don’t work or are tacky. We have to care enough to examine each element and root it out if it’s cliché, regardless of what our peers might be doing. According to Zinsser, “the only way to avoid it is to care deeply about words.” We also have to care about the data.
Zinsser’s last principle is about determining whether what’s new should be “ushered in” as accepted practice, or whether it should be “thrown out on [its] ear.” For any field to be vibrant and thriving, and for it to be at all fun, it must be fluid and not static. Is data viz fluid, or is it static?
Just as there is “no king to establish the King’s English,” there is no anointed panel to accept or reject new methods or tools in data viz, at least not that I know of. We all get to cast a vote by what we use. One of the chief values that innovators like Accurat and Periscopic bring to the field of data viz is a fresh take on this business of communicating with data. We all get to observe one another’s work, and if we keep the principles in mind, the history books will determine what gets kept and what gets left behind.
I’m confident that we’ll get the pixels right.
Thanks for reading,
Earlier this week I was visiting family in Ventura County, California. I had a nice view of the sunset one evening, and I noticed how much the color palette of the sky changed in seven minute increments. I used an iPhone to snap the pictures, Instant Eyedropper for Windows to pull out the hex codes of the uploaded images, and the website Color-hex.com to create the color squares beneath. Enjoy!
Previous – “Part 1: Gaps Between Data and Reality“
I’m reading “Thinking, Fast and Slow” by Nobel Prize winner Daniel Kahneman – a frighteningly interesting book about cognitive biases and “heursitics” (rules of thumb) in decision making. If you deal with numbers at all and haven’t read it yet, you should. In it he refers to an article by Howard Wainer and Harris L. Zwerling called “Evidence That Smaller Schools Do Not Improve Student Achievement” that talks about kidney cancer rates, of all things.
Kidney cancer is a relatively rare form of cancer, accounting for only ~3% of all adult cancers. If you look at kidney cancer rates by county in the U.S. an interesting pattern emerges, as he describes on page 109 of his book:
The counties in which the incidence of kidney cancer is lowest are mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West. p.109
What do you make of this? He goes on to list some of the reasons people come up with in an attempt to rationalize this fact: residents of rural counties have access to fresh food, lack of air pollution, etc. Did these explanations come to your mind, too? He then points out the following:
“Now consider the counties in which the incidence of kidney cancer is highest. These ailing counties tend to be mostly rural, sparsely populated, and located in traditionally Republican states in the Midwest, the South, and the West.”
Again, people come up with various theories to explain this fact: rural counties have relatively high poverty rates, high fat diet, lack of access to medication, etc.
But wait – what’s going on here? Rural counties have both the highest and the lowest kidney cancer rates? What gives?
Insensitivity to Sample Size
This is a great example of a bias known as “insensitivity to sample size“. It goes like this: when we deal with data, we don’t take into account sample size when we think about probability. These rural counties have relatively few people, and as such, they are more likely to have either very high or very low incidence rates. Why? Because the variance of the mean is proportional to the sample size. The smaller the sample, the greater the variance (proof).
I found the 2007-2011 kidney cancer rate data and the 2010 population data for each U.S. county, and created this interactive graphic to illustrate the point that Kahneman, Wainer and Zwerlink are trying to make:
Notice a few things in the dashboard above:
- In the choropleth map, the darkest orange (high rates relative to the overall U.S. rate) and the darkest blue (low rates relative to the overall U.S. rate) counties are often right next to each other
- In the scatterplot below the map, the marks form a funnel shape, with less populous counties (to the left) more likely to deviate from the reference line (the overall US rate), and more populous counties like Chicago, L.A. and New York more likely to be close to the overall reference line
- If you hover over a county with a small population, you will notice that the average number of cases per year is extremely low – 4 cases or less sometimes. A small deviation – even just 1 or 2 cases – in a subsequent year will shoot a county from the bottom of the list to the top, or vice versa
Where else does “insensitivity to sample size” come up? My colleague Dash Davidson suggested “streaks” in sports, which can often be just a “clustering illusion“. We look at a brief sample of a player’s overall performance and notice temporary periods of greatness. We should expect to see such streaks for even mediocre players. Remember Linsanity? Similarly, small samples make some rich and others poor in the world of gambling. You may have a good day at the tables, but if you keep playing, eventually the house will win. And in investing, “diversification” is nothing more than a strategy to minimize exposure to extreme downside risks of individual securities (think Enron).
Kahneman and his long-time partner Amos Tversky even showed that 84 professional psychologists were subject to this very same bias, so experts are not immune.
Avoiding this Pitfall
So what do we do about it? How do we make sure we don’t fall into the pitfall known as “insensitivity to sample size”?
- Be aware of any sampling involved in the data we are analyzing
- Understand that the smaller the sample size, the more likely we will see a rate or statistic that deviates significantly from the population
- Before forming theories about why a particular sample deviates from the population in some way, first consider that it may just be noise and chance
- Visualize the rate or statistic associated with groups of varying size in a scatterplot. If you see the telltale funnel shape, then you know not to be fooled
The point of the original article by Wainer and Zwerling is that smaller schools are apt to yield extreme test scores by virtue of the fact that there aren’t enough students in small schools to “even out” the scores. A random cluster of extremely good (or bad) performers can sway a small school’s scores. At a very big school, yes a few bad results will still affect the overall mean, but not nearly as much.
Here’s another way to think of it: if Daniel Kahneman ever moved to Lost Springs, Wyoming, then half of the town’s population would be Nobel Prize winners. And if you think that moving there would increase your chances of winning the Nobel Prize, or that it’s “in the water” or some other such reason, then you’re suffering from a severe case of insensitivity to sample size.
Do any other examples of this pitfall come to mind? Ever fall into it yourself? Share by leaving a comment below.
Thanks for stopping by,
Happy New Year! In 2015 I’ll be publishing a periodic series of blog posts entitled “Avoiding Data Pitfalls” where I’ll suggest ways to avoid common errors people make when working with data. The pitfalls range from philosophical to technical, and from analytical to visual. I’m familiar with these pitfalls because I’ve fallen into them myself, some of them repeatedly. If I’m the only one that these posts keep out of trouble, then it’ll be worth it.
We fall head first into a pitfall when we fail to remember that a gap exists between our data and reality. Do people really fail to realize this? I see (and make) this mistake quite often. I’m starting with this one because it’s foundational, dealing with the grounds and limits of our knowledge. How does it work?
It works like this: we get some data, and run with it, never stopping to think about where it came from, who collected it, what it tells us, and, importantly, what it doesn’t tell us.
It’s easy when working with data to treat it as reality rather than data collected about reality. Here are some examples:
- It’s not crime, it’s reported crime.
- It’s not the number of meteor strikes, it’s the number of recorded meteor strikes.
- It’s not the outer diameter of a mechanical part, it’s the measured outer diameter.
- It’s not how the public feels about a controversial topic, it’s how survey respondents are willing to say they feel.
- It’s not how many people suffer from a particular disease, it’s how many times a doctor diagnosed people with a particular disease.
You get the picture. This distinction may seem like a technicality, and sometimes it is (the number of home runs Hank Aaron “reportedly” hit?) but it can also be a big deal. Let’s see an example of how missing it can lead us astray:
Example #1: Actual vs. Recorded Earthquakes
Consider earthquakes. The USGS provides a Earthquake Archive Search form that lets visitors obtain a list of historical earthquakes that meet various criteria. A query of earthquakes of magnitude 6.0 and above from 1900 to 2013 yields the following, somewhat alarming, line plot:
Are we really to believe that earthquakes have increased in frequency by this much? Obviously not. The world that measured and collected earthquakes in the early 20th century was very different than the one that did so in the last decade. Comparisons across decades, and even within some decades (the 1960s), aren’t “apples-to-apples” due to the changes in technology.
If we separate the line plot by magnitude and add annotations that describe advances in seismology, we see that the rise is only in the smaller group (magnitude 6.0-6.9), and coincides with dramatic improvements in instrumentation:
It’s safe to say that the rise in recorded earthquakes is primarily due to the improvements in our ability to detect them. There may also be an upward trend in actual earthquakes over this time, but our data doesn’t reveal it to us due to the continual changes in the quality of the measurement system. When it comes to earthquakes, the gap between data and reality is getting smaller. The problem is that the “data-reality gap” is changing over the time period we’re considering. And it’s hard to know for sure exactly how many magnitude 6.0 earthquakes we missed in any particular year.
Let’s look at another example – counting bicycles that cross a bridge.
Example #2: Counting Bicycles
Everyday on my way to work I walk across the Fremont Bridge. It’s a bright blue and orange double-leaf bascule bridge that was built in 1917. Since it sits so close to the water, it opens on average 35 times a day, which supposedly makes it the most opened drawbridge in the United States. Here’s what it looks like:
The City of Seattle Department of Transportation has installed two “inductive loops” on the pedestrian/bicycle pathways of the bridge that are designed to count the number of bicycles that cross the bridge in either direction, all day every day. The city also provides hourly counts going back to October 2, 2012 at data.seattle.gov. Downloading this data and visualizing it yields the following timeline:
I showed this data at a recent luncheon of the Puget Sound Research Forum, and asked what the attendees thought of these spikes. I honestly didn’t know what had caused them. A few ideas sprang from the crowd – was it “bike to work day”, really good weather, or maybe there was some big bike race or club event? Notice how each of these ideas is based on the assumption that there actually were more bikes that crossed the bridge on those days.
David Bauer was in the audience and found the answer for us: equipment error. The counters just glitched for a few hours on both days. You can read all the details of these anomalous readings and the correspondence between a local blogger and a city employee at the Seattle Bike Blog. I won’t repeat it here, but suffice it to say measuring things can be tricky. Turns out a low battery was the culprit.
Let’s consider one last example: counting Ebola deaths.
Example #3: Ebola Deaths
This past year, the whole world watched in horror as Ebola ravaged West Africa. It’s still happening, by the way, we’ve just stopped noticing. In any case, the WHO provides data about fatalities in weekly situation reports. I had an interesting discussion on twitter with Alex McDonnell about this data. In it he referred to errors in the WHO reports.
Errors? About one of the world’s most closely followed topics? From one of the world’s most respected organizations? You bet.
Let’s take a look at a timeline of cumulative deaths from Ebola as reported by WHO and CDC. Notice the drops in cumulative death counts – the handful of times when the lines slope down:
Of course it makes perfect sense: the task of diagnosing disease and ascertaining causes of death in some of the more remote locations, where the equipment and staff are often severely limited, must be incredibly difficult. If you read the WHO situation report you’ll notice that they classify cases as “confirmed”, “probable” and “suspected”. It’s not always so obvious. Here are the criteria:
The WHO and CDC actually do a very good job of speaking clearly about “reported” cases (the Dec 31st WHO situation report includes the word “reported” no less than 61 times).
I don’t bring up this example to criticize the people or organizations involved with fighting and documenting the Ebola outbreak. Far from it. This example merely demonstrates that the gap between data and reality can exist even when the stakes are high. Classifying diseases and deaths in chaotic conditions can be tricky business indeed.
How to Avoid Confusing Data with Reality
Notice that in these three examples – 1. earthquakes (a dubious trend), 2. bicycle counting (a spike or outlier), and 3. Ebola deaths (a downward slope in a cumulative line plot) – something in the view of the data itself alerted us to a potential “data-reality gap”. Visualizing the data can be one of the best ways to find problems with it.
Earlier in the game, though, it helps to remind ourselves that every data point that exists was collected, stored, accessed, etc, via imperfect processes. The more we know about these process – the equipment used, the protocol followed, the people involved, the steps they took, their motivations – the better equipped we will be to assess the “data-reality gap”.
six seven suggestions to help you avoid confusing data with reality:
- Clearly understand the operational definitions of all metrics
- Draw the data collection steps as a process flow diagram
- Understand the limitations and inaccuracies of each step in the process
- Identify any changes in method or equipment over time
- Seek to understand the motives of the people collecting and reporting. Could there be any biases or incentives involved?
- Visualize the data and investigate any shifts, outliers and trends for possible discrepancies
- Think carefully about data formatting, processing, and transformations (thanks Keith!)
Ultimately, each data collection activity is unique, and there are too many possible sources of error to list them all. These are some typical ones that I’ve come in contact with, and you may have your own suggestions. I’d love to hear them.
At the core of this first “data pitfall” is our attitude toward data. Do we arrogantly or naively see ourselves as experts on a topic as soon as we get our hands on some data, or do we humbly realize that our knowledge is imperfect, and we may not know the full story?
We can’t ever perfectly know the “data-reality gap” because that would require perfect data. What we can do, though, is seek to identify any gaps that may exist, and take that into account when we use data to form our opinions.
Thanks for reading,
Since Kobe Bryant surpassed Michael Jordan in “career points” scored this past Sunday, much has been written about who is better, etc. If a) you’ve heard enough already, or b) you don’t care about sports at all, then you may stop reading now. There is a finer point about “data definitions”, but I’ll leave that for later. First, back to basketball player egos:
There is one thing about this debate that puzzles me: why regular season stats alone are used in the “career” totals. I can understand leaving out the All-Star game stats, as the All-Star game is largely a meaningless pick-up game (and crazy points-fest). But playoffs? NBA basketball is all about the playoffs. That’s where the games really matter, the real drama unfolds, and Hall-of-Fame “careers” are made.
Bryant and Jordan have both had considerable success in the playoffs, but somehow the points they scored during these crucial games don’t count to their career totals? I don’t get it. If there’s a compelling reason to leave out playoff stats from “career totals”, I’m unaware of it.
And if you take into consideration playoff point totals when tallying career stats, Bryant hasn’t passed Jordan quite yet:
Of course you can make the argument that this just postpones the inevitable by a few weeks, as Bryant will soon surpass Jordan in total points scored including the playoffs. So who cares?
To me, it just shows how we can get wrapped up in debates about numbers without stopping to consider the “data definitions” – exactly what are we comparing? How is this data collected, what is included and what is not included? Does it even make sense?
Did you know all those dramatic game-winning shots these two players made in crunch time aren’t even included in their career totals?
Like this one:
And these ones:
I bet you didn’t. If you did, can you defend it? I can’t.
At some point in the next month or so, Kobe will pass Jordan in the amount of total points he has scored in both the regular season and the playoffs combined. There will be no hugging at center court, no interviews, no headlines or blog posts about the stats. That’s not such a bad thing, though, I guess.
As for who is the better scorer? Jordan scored at a higher rate, but took two extended breaks during his career. Bryant took no such breaks, and only had one considerable setback to injury, so he racked up points at a slower, albeit unabated pace. And for all the flak Bryant has taken for being a ball hog, try comparing his career Assists with Jordan’s. He passed Jordan in assists two years ago.
Thanks for indulging me,
In June of this year I published my first book with O’Reilly Media called “Communicating Data with Tableau”. It has been great to hear from readers around the world, and I’m grateful for the reviews that have been published.
Here is a sampler that includes the entirety of Chapter 1 (entitled “Communicating Data”) in pdf format, freely available for anyone to download and share (click to download the pdf):
In Chapter 1, I share my thoughts on the concept of creating and sharing data visualizations as a particular form of the communication process. Thinking of it that way has been very helpful for me, because it frames the activity as an attempt to affect the minds of others.
Just like other types of communication such speech and even body language, its success depends on many factors, some obvious and some subtle. It is subject to problems of various kinds, as identified by the fathers of information theory, Shannon and Weaver. And critically, communicating data touches on both the rational and the emotional.
Ultimately, I outline 6 Principles of Communicating Data, for which I have also created a handy online checklist. These principles have helped me focus my efforts and avoid some common pitfalls that I have fallen into in the past, such as failing to identify and know my audience, or only considering a subset of the relevant data.
Thanks for visiting, and please be sure to send me a note to tell me what you think about these resources, and what you would add/edit/remove.
Vox published an interesting post today called “America has stopped paying attention, but Ebola is still ravaging Sierra Leone”. It made me wonder whether it’s just America that has stopped paying attention, or if in fact other parts of the world have moved on as well. I turned to Google Trends to look at relative search popularity of the string ‘ebola’ over time for various countries. Here’s what I found:
A quick glance suggests that other unaffected countries have moved on as well. At least these nine have. More specifically, people in these other countries are also using Google to search for the English word ‘ebola’ far less frequently than they were earlier in the year, especially during early to mid October, when the hype hit it’s peak (except in India, where search was highest in early August).
One interesting country to look at is Japan. The dashboard above seems to indicate that no one in Japan cared at all about ebola, all year long. Is that true? No, it’s not true, and it highlights one of the limitations of using this type of data to answer this question. If you do a similar Google Trends search for エボラ, the Japanese word for ‘ebola’, here’s what you get:
It helps to understand exactly what your data is telling you, and what it isn’t telling you.
Ebola is the same word in English, Spanish, Portuguese, German, Italian, and Hindi, so the other countries probably don’t have a similar problem. I wasn’t able to find the Pashto or Dari (Afghanistan) translations of the word “ebola”, but suffice it to say that Google search trends are far less effective a proxy for popular interest in Afghanistan, where only 5.9% of the population uses the internet, according to The World Bank.
Finally, if we compare relative Google search popularity for ‘ebola’ in heavily affected countries like Liberia, Sierra Leone, and Guinea, here’s what we find:
- Get the raw data Excel file here
- Since Google Trends only allowed me to compare 5 countries at a time, I had to run two separate queries, with United States included in both queries to maintain a common reference point of comparison.
- After running the queries, I downloaded the data as a CSV by clicking on the gear icon in the top right corner of the Google Trends page.
- I combined both CSV downloads into one spreadsheet and used the Tableau Reshaper Excel Add-in (Windows only) to convert the resulting cross-tab table into a long list of data values – a single row for each week for each country.
As always, let me know if you have any thoughts about this topic, my approach to understanding it, or the visualization I created to communicate my findings.
Thanks for stopping by,