Exploratory Data Analysis is Mostly Just Vibes, Man

Michael Correll
12 min readMar 13, 2024

--

Painting of a fisherman holding oars in a boat laden with fish, looking out into an ominous sky.
The Fog Warning” by Winslow Homer, 1885.

This blog post accompanies a paper to be presented at CHI 2024, “Odds and Insights: Decision Quality in Visual Analytics” written by Abhraneel Sarma, Xiaoying Pu, Yuan Cui, Eli T. Brown, Michael Correll, and Matthew Kay. For more details, read the paper.

There is a visualization paper that has never been far from my mind for almost exactly six years at this point. That paper is Zgraggen et al.’s “Investigating the Multiple Comparisons Problem in Visual Analysis.” I wish I could find all of my notes from the time (there were several pages at one point: some of them got folded into this blog post) but I remember thinking something like “well, hmmm. I guess the jig is up.”

Part of that response is actually anxiety about an only partially related issue. So here’s a brief(ish) detour, in the form of a slide from my usual lecture on uncertainty visualization:

A chart titled “which stock to buy” showing bar charts of stock prices for company A and company B for the same time periods. Company A has both the highest value but also the lowest value.

What I have students do is to look at these charts, nominally of stock prices over time, and tell me which is the better buy. There’s some hesitation and asks for clarification (like asking for axis labels and other such niceties) but usually after a while I get something like “well, Company A is more volatile but seems to be hitting higher highs, whereas B has lower but steadier prices” (or something like that). By this point they usually suspect something is up, though, and so often it’s not much of a surprise when I reveal in the next slide that the data for both charts were generated just by calling Excel’s RAND function a bunch of times and any purported pattern is just an illusory artifact of the luck of the draw.

There’s a few things going on with this example

  1. Apophenia (and/or Pareidolia): we are very good at seeing patterns in data, even when there is nothing to see. Pattern recognition is just a really human thing to do.
  2. Demand characteristics: By asking a question and waiting until I got answers, I was implicitly creating an environment where I was expecting people to give me something, even a pattern they weren’t too sure about.
  3. The “Lure of Incredible Certitude”: I was giving people a task (investing) that by necessity involves risk and uncertainty, but I wasn’t really couching the problem as one that involves uncertainty, or presenting them data with the relevant uncertainty information included.

But these three points, together, contribute to what I think is a pretty strong critique of exploratory data analysis (EDA) as it sometimes is envisioned, as this process of digging into a dataset and reporting out whenever you find interesting patterns. Namely, that those resulting insights will be mostly horseshit because you went on a fishing expedition for random stuff, were incentivized to report whatever you found, and, because you were likely just generating bar charts and scatterplots and so on, didn’t have the usual tools (like inferential statistics or modeling or what have you) to separate the metaphorical wheat from the chaff. And if that’s the case, what the hell are we even doing?

Okay, with that out of the way, we can go up one level of recursion to talk about the Zgraggen et al. paper (when we’re done with that, remember that we still have one level more to go, where we talk about the actual paper we just put out). The Zgraggen et al. paper’s critique of EDA has similar antecedents to mine, but is anchored more strongly on the multiple comparisons problem (MCP). The general idea is that:

  1. In the process of EDA, you look at lots of charts. These charts (may!) function as hypothesis tests, either implicitly or explicitly: an insight might be, e.g., that two groups are significantly different, or that two variables are highly correlated, which are questions that connect to testable statistical hypotheses.
  2. Even if there is no difference in populations, by sheer chance and sampling error, some of the samples from those populations will pass the standards of our hypothesis test, whether visual, statistical, or visual-statistical. That is, there will appear to be an insightful pattern in the samples, even though there is no “true” pattern in the population.
  3. As the number of charts, and so number of comparisons, increases, the likelihood of encountering such an apparent but ultimately false difference, and so making this kind of false positive error, gets larger and larger.
  4. People will be unable to disambiguate these apparent differences caused by noise and sampling error from “true” insights, and so outputs from an EDA session are highly likely to contain false insights.

For instance, here’s some data on unpopular pizza toppings in the United States:

A chart titled “Anchovy Liking States” showing the percentage of U.S. states that like anchovies on pizza, with a reference line and band at an average of a little above 50%. Maryland is the highest percentage state (over 80%), and Iowa the lowest (under 30%).

In line with the stock example, you might look at this and say “wow, Maryland really likes anchovies more than the average, I wonder if that’s because they are close to the sea or if they just consume more seafood in general, and that explains why landlocked states like Iowa seem to hate anchovies.” Or you might say something like that if you hadn’t read the proceeding paragraphs, because what you are likely thinking is “I bet this is just random noise.” And you would be right, dear reader, because this data is just me drawing numbers from a Gaussian distribution with a mean of 50% and standard deviation of 10%, sorting them, and calling it a day. There’s no actual signal here, as we can see when I generate the data in the exact same way, but this time claim it’s about pineapple on pizza:

A chart titled “Pineapple Liking States” showing the percentage of U.S. states that like pineapple on pizza, with a reference line and band at an average of a little above 50%. Georgia is the highest percentage state (over 70%), and Oregon the lowest (under 40%).

Where Maryland and Iowa are now closer to the middle of the pack. You can think of it kind of like luck: I flipped a coin 50 times (once per state) with an assumption that I would get “lucky” (in this case, find a result that was particularly “anomalously” high), and then pretended that was an insight, and then some of my post hoc rationalizations and theorizing would kick in and I started coming up with explanations for this (apparent) finding that is, by construction, just statistical noise.

The Zgraggen et al. paper conducted an experiment of roughly this form (give people a dataset that is mostly Gaussian noise, but with a few genuine correlations in it), let people go wild generating the graphs they wanted to see and recording the subsequent insights, and then tallied how often people got tricked by the noise rather than the signal. This gets you the very scary lede in their abstract: “In our experiment, over 60% of user insights were false.” The MCP-based rationale would be that users should really be performing some sort of correction to account for multiple comparisons (the equivalent of turning some little knob labeled “insight conservatism” up as the number of comparisons they made increased) and, well, their subjects didn’t seem to.

Our paper (we’re almost done with the big digressions now, I promise, so you can relax whatever mental equivalent of a stack data structure you have in your head) was an attempt to dig a little bit more at this issue of how doomed EDA is given the MCP, mainly from two counter arguments:

  1. We are giving people the wrong tools for the job: if I’m asking you questions about uncertain information, it’s at least a little bit rude that I’m not giving you visualizations that communicate uncertainty in any direct way. Kind of like giving people peanut butter, jelly, bread, and a knife, and then acting surprised that they made a PB&J sandwich and not filet mignon.
  2. We are giving people the wrong incentive structures: I could be wrong but I am speculating that you, the reader, do not give a single solitary shit about fictional pizza topping preferences or fictional stock performance because it is, well, fictional. So your quality assurance standards for reporting out on the data on those topics are probably pretty low. But if you were, say, the person in charge of deciding when to shut down a nuclear reactor, you might do a bit more due diligence than just looking at the first bar chart you could find.

So that’s (finally) what this paper was about: if you give people uncertainty information, and incentivize them to avoid false positives, do they still fall for errors that arise from the MCP?

It sounds straightforward, but it’s not, really, because the notion of a null hypothesis significance test is just a totally alien thing that has very little relationship to how people make decisions and reason about data. For instance, that whole p = 0.05 thing that seems really really important? That’s just because a random eugenics weirdo said that 1/20 was “convenient” as a threshold in 1925. So the proposed statistical corrections to account for the MCP (stuff like the Bonferonni correction, where you divide that 1/20 by the number of total comparisons you’re going to make) are even more alien to the EDA experience. Like, imagine you opened an analytics tool, looked around at some charts, and then the program immediately crashed to desktop because it had determined that you had looked at too many charts today and was quitting to save you from potential false positives.

And that’s even assuming that people are dealing with the case where we have samples from an unknown population, which is the frequentist framing where p-values start making sense. EDA is not always about inferences based on samples, or at least not in a straightforward conceptual way! Like, if your company made $2 million last year and $3 million dollars this year, the concept that you have to test whether this was a “significantly significant” difference is just not particularly interesting, because an extra million bucks spends the same whether it’s statistically significant or not (although note that stuff like hiring or firing CEOs based on earnings performance does get into MCP territory, and runs afoul of many of the issues discussed above). So a big intellectual challenge of this work was, in my opinion, creating an incentive structure that matches anything close to the kind of scenario where the MCP would kick in, or people would be incentivized to be the kind of conservative about false positives that would be warranted in such a scenario.

We thought it was highly unlikely that people would be “automatic t-test machines,” but I did think that people would have kind of “t-test vibes” when making decisions, in the right circumstances. That is: big difference in means, small variability? That seems like a good bet. Smaller difference in means, or bigger variability? Maybe less of a good bet. And, once you are in a position where you start to start seeing lots and lots of these comparisons, and have incentive structures to avoid false positives, maybe you either start out more conservative in your guesses or become so as you adapt to the incentive structure.

So that was the experimental task here: sort of beat an EDA-style problem (rather artificially) into a frequentist shape. In this case, you have samples of sales data from a particular region of stores, but don’t know the full population data from each region, and need to predict per-region population profitability. We then give people enough comparisons (different sets of samples from different regions) so MCP issues start kicking in, give them an incentive structure that rewards the kind of conservatism we want (so, e.g., in this case, you lose 150 points if you make a false positive guess, but only get 50 for a true positive), give them some uncertainty visualization rather than just the sample mean, and see what people do. Here’s an example stimulis, in this case from our “CI” (confidence interval) visualization condition:

Depiction of 8 similar graphs of averages with confidence intervals showing sample profit. 6 out of the 8 charts show an average profit, but 4 of those 6 charts have confidence intervals that cross 0.

I would treat this one pretty conceptually similarly to the pizza topping examples I showed you before, with two distinctions. The first is that I’ve got only got 8 “states” instead of 50, but the second is that there are some actual patterns hidden here in the sea of statistical noise, the equivalent of a state or two that really has a strong opinion for or against pineapple on pizza or what have you. Finding those is going to be a mix of looking at robustness of patterns (differences in means and variability) but also a non-trivial amount of pure luck: not all of the regions with positive sample profits were profitable as a population, and sampling error means even very high sample means have a non-zero probability of being false positives: no decision criteria is going to be perfect based only on the information provided: just better or worse, conservative or less so.

I do want to emphasize once again how many boxes you have to tick before it’s really a “fair” problem where we start caring about stuff like the MCP. If I was just asking for profitability of the stores per se in the sample, then there’s not really the same sort of thing as a false positive or false negative unless the viewer is misreading the chart. It’s only when we go into inference land that we’d start running into trouble (and that can, in fact, happen in EDA: I describe such a scenario in an earlier blog post).

So, with the preliminaries (finally) out of the way, what did we find? Well, lots of things, but here’s a big one as far as I’m concerned, here testing between presenting the information as a mean + confidence interval (CI), a density curve (PDF), or just giving people the sample data without any sort of aggregation or modeling and relying on visual estimates to do things like infer sampling means or variability (baseline):

Chart showing the probability of false positives as depicted by post hoc probability density functions. The blue PDFs for the confidence interval and probability density function conditions are all below (better than) the red lines indicating our uncorrected strategy, but above (worse than) the lines for the optimal Benjamini-Hochberg strategy. The baseline condition is worse than all of these.

The red line in the chart above is what we called the “uncorrected strategy” in the paper but that I called the “t-test golem” in line with my thinking above. That would be the decision quality if you just ignored the MCP, looked at the mean and error, and gave a thumbs up to every visualization where p<0.05. So, for our data, this would result in the golem making errors on around 1 in 5 positive judgments. And then that blueish line on the left is if you used the laser-guided statistical state of the art procedure for accounting for the MCP, the Benjamini-Hochberg correction, where your t-test golem first does a bit of estimating and sorting and line-fitting before setting a decision threshold.

The human performance, then, is not very golem-like. But we do find that, when you give people uncertainty information, participants can better react to the MCP. They don’t necessarily react as much as they should, and they aren’t consistently hitting the theoretical limits of statistical MCP correction methods, but they are doing something. What it is, I can’t say exactly. I don’t think it’s explicitly golem-y stuff like some sort of Bonferonni correction where they have a mental p-value in their head that they divide based on the number of comparisons, but they weren’t helpless here. The exception is the result that if you don’t give people uncertainty information, participants struggle to react to the MCP. They do worse than our poor strawman (clayman?) uncorrected t-test golem! This is perhaps disappointing if we were hoping we could get away with making EDA systems that just show bars and scatterplots without any uncertainty information and hope that things will just work out, but maybe not surprising for those holding the idea that if you give people tasks involving uncertainty, maybe, just maybe, you should directly give them uncertainty information as well.

There’s much more in the paper, and a deeper analysis of patterns of performance and error, but I did want to end with a few thoughts. The first is that I think this is, overall, a bit of a mixed bag for EDA in a “post-Zgraggen et al.” world (I have to come up with a pithier name for this concept). Okay, people weren’t making errors 60% of the time, like in the Zgraggen et al. work. But, on average, even for people given uncertainty information, it was still more like a 30–40% rate of false discoveries (both false positives or false negatives), and that’s not exactly a number that makes you fall back into dogmatic slumber (although note that we were setting the rate of true positives here to around 30% to try to optimize for detecting differences in strategy, which has a lot to do with the floors and ceilings of that number here). And the second is there just seems to be a affordance mismatch between the types of visualizations we see in traditional EDA and statistical graphics: they really do seem to be different looks for different goals. The way we conceptualize EDA, as poking around until we find something, is so alien to the sort of ways we think about inferential statistics and experimental design and so on that we perhaps shouldn’t be surprised that they behave differently. That might even be okay, some of the time: lots of things that help us make data-driven inferences about the world aren’t a lot like p-values or null hypothesis significance testing, and that might be a problem with statistics more than it’s a problem with people.

--

--