# Raining on Visual Statistics’ Parade?

*This blog post accompanies a paper to be presented at IEEE VIS 2023, “Fitting Bell Curves to Data Distributions using Visualization” written by Eric Newburger, Michael Correll, and Niklas Elmqvist. For more details, **read the paper**.*

I’m increasingly using these paper explainer posts as ways to riff on the general topic of the paper, explore my own neuroses, and otherwise do a bunch of things that are not actually explaining what the paper is about, so apologies to my co-authors here (although Eric has a post about an earlier version of this project if you want a more traditional explainer), but I wanted to use the opportunity of this paper’s presentation to talk about a concept that has been more or less at the center of my work in visualization for over a decade now, the notion of **visual statistics**. My question is: can the visual statistics project weather the storm clouds on the horizon?

Here’s how the usual story goes:

Here’s a line chart of average global surface temperature for the past 150 years or so. What do you see when you look at it? I think something close to “temperature, overall, is steadily increasing” or “there’s an upward trend in temperature” or “while, year to year, temperature can go up and down, over the long term things are warming” etc. etc. I can ask you questions like “how noisy are the data?” or “are there any outliers?” or “which decade has the lowest average temperature” and get answers, or at least educated guesses. But you (probably) didn’t get out a ruler and a calculator before you started answering. You just looked at the chart, and made some statements about the *statistical* properties of the data from the *visual* properties of the chart. That’s part of what I mean by **visual statistics**: the capacity to answer statistical questions visually. It’s also an *orientation* towards thinking about visualization design: a mindfulness of what statistical tasks a visualization supports, and occasionally intentional design choices to make some of these tasks easier (for instance, if I thought you were going to do a terrible job estimating the overall trend, I might plop a trend line in there to help you).

At first blush, this is might not seem like a big deal. “Oh, people can look at charts and can answer questions about the data, I already knew that.” but I think the visual statistics project is actually pretty radical. For one, it promises a democratization of analytics: very few people have in-depth training in statistics, but lots of people can interpret charts and graphs. For another, it means that a lot of the foundational empirical work on visualization (which often treats efficiency/utility in terms of extracting individual values from charts, rather than aggregate statistics or models) needs to be extended or rethought. If we can accomplish all of this, and clever visualization design and some relatively lightweight instruction is all that it takes for people to be pretty good visual statisticians, then we’ve created a fundamentally different world and relationship between people and data.

But there’s a few areas of pushback here. The first is whether we are *good enough* visual statisticians. We “eyeball” or “sanity check” our visualizations, which seems, even at the level of the linguistic metaphor used, a far less precise process than, say, *calculating*. Would I trust my doctor if they just looked at a scatterplot of some clinical trial and said “well, looks like the treatment does better than the placebo, let’s operate”? The second is just how darn easy it is to be *fooled *by charts. Visualizations can look like they are showing a statistical pattern that totally falls apart under scrutiny. And those estimates we are making are likely driven by “visual proxies”: visual features or patterns that *can be *associated with the statistics we care about, but don’t *have* to be. Here’s an example that kept me up at night, adapted from Richard McElreath:

Here are two scatterplots with roughly the same positive trend. Which one has a stronger correlation/goodness of fit with respect to this trend?

Visually, it looks like the plot on the left, right? The points are closer to the trend line, more tightly packed and, while there are a couple of outliers compared to the right’s uniform mass of points, there are also more points that are almost directly on the trend line.

The answer is that they both have the same r value (0.10), down to a couple of decimal points (in fact, the one on the right is just a rank order transform of the one on the left). But we assume that the *visual dispersion* of points is sufficient information to determine the *statistical variability* of points. And, in this case (where points are dispersed in both *x* and *y*), it’s not. I made a notebook that was essentially me trying to come to grips with this fact (Steve Haroz then turned around and made a better version). But it’s just one of many examples you could generate, with whatever combination of real or artificial data you want, where what we see (or estimate) doesn’t match with the alleged statistical quantity of interest.

My first reaction to all this is an immediate *tu quoque*: well, it’s not like *statistical practice *doesn’t fall victim to those issues either. What we want to learn from our data is not just some arbitrary set of statistical values: we want to discover *phenomena about the world*. If the visual dispersion of points isn’t always a great proxy for correlation, it’s not like r values are always great proxies for what we actually care about (like, say, a predictable causal connection between two variables), either. One of my favorite papers with the excellent title “*The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research*” has a whole chunk about how statistical variables like p-values don’t capture the useful properties we want them to, but are often treated like magic incantations. We can’t visually estimate p-values? Good! Who says I want to? But this is all just an immediate bristly defensive reaction; I think a more productive lens is to look at *alignment. *Okay, visual and statistical judgments don’t align? Sure, but *when* don’t they align, *where* do these processes break down, and *why*? Which of these visual judgments are “good enough for government work:” reliable in enough cases, for enough data sets, and with enough precision, that we can use them in the frequent scenarios where we don’t have a statistician at the ready?

Hopefully now, all these paragraphs later, you have an idea of the baggage I was carrying around for the actual paper I want to discuss today. I think it’s a pretty simple paper: we gave people a bunch of univariate visualizations like histograms or dot plots, and then asked them to fit gaussian curves to the data by dragging a slider around to adjust the mean and standard deviation. That’s the *what*, but I think the *why* (which reviewers kept asking us to cut for being a distraction, but I really think it’s key to why you’d want to even run these kinds of studies) is worth discussing: we wanted to build some trust in visual statistics, for basic and then increasingly complicated statistics until, eventually, we know when and how to intervene as designers to be confident that our target audience is pulling out “good enough” conclusions from the charts we give them. You can either go fully towards a shining future in which you don’t need to be an over-educated weirdo to accurately assess statistical claims made in the media (or even in scientific papers), or even the more modest goal of figuring out when and how to augment or replace *textual* *statistical *arguments with *visual statistical *ones (and yes, both statistics and visualizations are implicitly or explicitly arguments), but in either case there’s a necessity here of seeing how, and how well, people estimate statistical information from charts.

So what did we find? How close are we to a visual statistics promised land? If I had to summarize it, I’d say it’s **when you ask visual questions, you get visual answers . **That is, the form of the visualization you present naturally influences the way that people extract information from it. This might sound obvious (and maybe it is), but I do think it puts paid to a particular theory of visual statistics as some sort of direct access to a statistical quantity. Rather than looking at a chart and saying: “from the arrangement of points, I can tell that the standard deviation must be approximately this value,” I think (but would be very excited to run some studies to investigate) that there’s an intermediary step where the visual patterns we pull out (say, cluster centroids or regions of high density) are mentally and perhaps even consciously converted to statistical quantities, with all the sort of estimation and anchoring and adjustment inherent in such things.

As an example of what I mean, people were generally pretty good about estimating means, regardless of the visualizations we gave them. But they were especially good at this for box plots. And why shouldn’t they be? We were showing them samples from a unimodal Gaussian data, and the box plot has a nice line for showing the median that just so happens, for unimodal Gaussian data, to be pretty close to the mean. So if your visual statistical strategy is “put the middle of the distribution where the median line of the box plot is”, you’d probably do pretty well. But you’d maybe screw things up for distributions (say, those with super long tails, or more than one mode) where the mean and the median are less tightly connected. And in those cases, the visual strategies you’d follow with dot plots or histograms might prove more reliable.

But the result I thought was the most fun (so much so that it’s what I allude to in the title of this post) is what we are calling “the umbrella effect.” For means, people were pretty good. But for estimating spread, we found patterns of both habitual overestimation and underestimation, depending on the visualization we showed. There was a bit of underestimation in box plots, and relatively more overestimation in dot plots, strip plots, and histograms. Here’s what I mean: here’s all the participant responses for the same data set in gray, centered around the true mean, and with the true distribution in red:

It’s here where “ask a visual question, get a visual answer” strikes again: we suspect (and I should emphasize here that this is, indeed, speculation) that people wanted to *fit all of the points “inside” the curve*. The curve needs to shelter everything, like an umbrella protecting all the points it contains. For box plots there’s less stuff that needs to fit, and the ends of the whiskers don’t capture all the action in the tails, so a little underestimation is to be expected. But for the dot plots, where there’s a bunch of points in the center, participants went pretty wide to fit as much as possible under their umbrellas.

So what? Do these errors and biases dash the hopes of democratized, visual statistics-based understand of data? I don’t think so, but it should hopefully dash one’s hopes that such a project will be easy, or will consist of just throwing the status quo of statistical graphics at the problem and calling it good. We might need crazier “xenographics” that more directly encode the statistics we don’t trust people to visually estimate in standard graphics. But I also think that it requires thinking not just of visualization as encoding *data* into pictures, but also how we can encode *questions* and *assumptions* into these same pictures: we might need to visualize *priors* with the same care that we visualize *results. *Or, we may need to make sure that the visualizations we show are better aligned with the statistical quantities and models of interest, and that visual conclusions match the statistical ones (I think confidence intervals are a particularly perverse case, where the visual overlap or non-overlap of the confidence intervals may or may not say anything definitive statistical significance of differences). There’s a lot of work ahead of us that moves from the world of graphical perception and time and error experimentation and into design and theory.