The Promises and Perils of “Automatic Data Visualization Recommendation”

Michael Correll
9 min readApr 27, 2022

--

A gladiator stands over a defeated foe. The bloodthirsty crowd is signaling “thumbs down.” I’ve been told that, contrary to popular belief, thumbs down was actually the “spare him” sign, but then I told that this is also a myth, and so we’re at nested layers of wrong, here. I’d advise just keeping your hands to yourself if you end up in ancient Rome, just in case.
“Pollice Verso” (Thumbs Down) by Jean-Léon Gérôme, 1872. Pretend the murmillos are battling pie charts or something, if that helps.

This blog post accompanies a paper to be presented at CHI 2022, “Recommendations for Visualization Recommendations: Exploring Preferences and Priorities in Public Health” written by Calvin Bao, Siyao Li, Sarah Flores, Michael Correll, and Leilani Battle. For more details, read the paper: https://arxiv.org/abs/2202.01335!

The pitch is pretty simple: you want to see what’s going on in your data, but you don’t want to fiddle with designs, sift through all the uninteresting stuff to get to the good parts, or figure out how to best translate your pressing questions into statistical graphics. The solution is to let computers take care of all of that for you. When it works, it’s a nice division of labor: you get to quickly see a high-quality visualization about something that you are interested in without mucking around and hoping to run into something insightful serendipitously: it “augments and enriches” the very human process of trying to make sense of data. And this automation plays nicely with hot topics like “machine learning,” “artificial intelligence,” and “human-computer interaction” as well (even if most of the systems use relatively simple statistical models or rule-based recommendation engines to actually do the automating rather than complex ML/AI), so if you’re looking for research grants or venture capital funding it’s a nice space to play around in.

A diagram showing our three categories of visualization systems: encoding recommenders, Q&A recommenders, and auto-insight recommenders on a fictional dataset of animals and their properties (in this case, the number of legs). The source data table is juxtaposed with a bar chart.
Each of our three categories of visualization recommendation systems might recommend a bar chart for this fictional dataset of animal data. The encoding recommender because it’s the “right” visual encoding for categorical data with a quantitative measure, the Q&A recommender because you asked it about legs, and the auto-insight recommender because it found a potentially insightful statistical leg-related pattern to highlight. Also, please don’t email me about millipedes: the number of legs they have varies but it’s less than a thousand.

This drive for automation has created an ecosystem of “visualization recommendation systems” or “automatic visualization systems,” including (but not limited to) the following categories:

  • Encoding recommenders: recommend a visual design for a particular chart, given, say, the shape of your data, or particular design constraints. Ideally, some of your data come in, and a well-designed chart comes out.
  • Q&A recommenders: given a prompt (say, a natural language sentence), recommends a chart that answers the question raised by the prompt. Ideally, an arbitrary question about your data comes in, and a relevant and satisfying (visual) response comes out.
  • Auto-insight recommenders: recommend a set of patterns, summaries, or events likely to be of interest in your data. Ideally, all of your data come in, and a (small) set of insightful charts come out.

It is this last category of auto-insights that I think deserves the most follow-up, since it a) is generating the most press and hype(especially for the people who think the job of visualization is “insights,” whatever that happens to mean) and b) where the actual systems have so much potential for misuse or harm: an encoding recommender might fail by recommending an ugly chart, but an auto-insight system might end up causing all sorts of havoc as people make the wrong decisions based on weak evidence.

What Can Go Wrong With Recommendations?

I am skeptical of the some of the thinking and many of the artifacts coming out of this new wave of automatic visualizations, and, since “academic writing at unclear levels of ironic detachment” is apparently the main hammer I have these days for the various socio-technical nails I see around me, I cowrote a quasi-satirical paper about it, where we compared these systems unfavorably to other techniques that purport to give you insight, such as the casting of bones or the drawing of tarot cards. Our general worries, especially about auto-insight systems, fell into several buckets:

Opaque

Because the recommendations are often driven by complex machine learning algorithms or statistical models, I often have no idea what’s happening under the hood of the recommender system, and so I have no idea whether to trust what I’m seeing. Or, worse, I’ll blindly trust the output when I really shouldn’t.

Inflexible

Recommendation systems often expect a very specific input, and generate very specific output. There’s a fixed language (e.g., of what counts as “insightful”) set prior to any exposure to data. If we don’t play along with these preconditions, or speak in the right language, then we will receive broken or nonsensical results back. Combined with opacity issues like those above, we might not even know what we’re missing out on.

Domineering

In exploratory data analysis, I have immense freedom to set my own priorities or dive deeply into my personal questions. In a recommendation-based world we often just get whatever the system gives us, which usually means having to accept the highly opinionated priorities of whoever built the recommendation engine.

Brittle

This concern is slightly different from inflexibility above, and deals more with the systems breaking once they leave the land of demos or predicting yesterday’s weather and are asked to operate in the real world. Especially for auto-insight systems, the search for particularly interesting patterns may have very little to do with the search for reliable patterns. I still think that a lot of auto-insight systems are just “p-hacking machines” with a nice coat of paint.

Exploring Preferences and Values

A gallery of twelve participant sketches of visualizations of the NHANEs survey data, showing a variety of visual designs and content, from pie charts to line charts to box plots to stacked bars.
A gallery of some of the (digital) sketches generated by our participants, used by a pair of mediators to create final visualizations. Our participants used a variety of visualization designs to communicate an equally diverse range of information about nutritional health.

But all of this skepticism was not particularly evidence-based. I am anxious about auto-insight technologies, sure, but I’m also anxious about whether or not I left the oven on whenever I go on trips, and it’s at the very least a bit rude to make my neuroses everybody else’s problem all the time. To me there’s an empirical question here about whether these potential hazards are as dangerous as I think they are, or if these mismatches I see between “manual” and “automatic” visual analytics as friction-inducing as I suspect. Or, to be even more negative, if the issues I am describing above are not unique just to “automated” analytics but are also just as deeply problematic for the ways people do analytics even without algorithmic help.

This is (finally!) where the paper we’re presenting at CHI 2022 comes in. What we did is recruit students and researchers of public health, gave them a well-known dataset (in this case the CDC’s National Health and Nutrition Examination Survey) and asked them to both rate the “insights” generated by state-of-the-art systems and, with the help of a pair of us acting as mediators, to perform the role of “insight recommenders” themselves. The goal here was to get some ideas of preferences and values: what did they like and not like in the purportedly insightful charts they were shown? What priorities would they insist on when it came time to generate their own charts that were intended to be insightful to others?

What we found was a mixture of good news and bad news for anybody hoping to create their own recommendation systems, and of course each of our participants brought their own opinions and priorities to the process. Nevertheless, we did pick out three major themes that seemed the most relevant:

Simplicity

Our participants valued simplicity. Simplicity can mean a lot of things, but here it ran the gamut from simplicity in the choice of visualizations (people were very unlikely to recommend anything too weird: bar charts, pie charts, and scatterplots were the usual orders) but also simplicity of the presented data (people would filter out data to present cleaner or less chaotic charts, and use text and titles to direct the intended viewer to the precise takeaways they needed). This desire for simplicity would come even at the cost of more accuracy or nuance: it doesn’t matter how clever a job your chart does at showing the cool thing in the data if nobody can understand it.

I think it’s also important that simplicity was often pitched in terms of some absent other: a stakeholder or a client or a mass public. The idea being to simplify, simplify, simplify so the intended audience wasn’t confused about what to do next, bored, or otherwise uncomprehending (although note that I think we are often unwilling to think about ways to better surfacing complexity in visualizations or otherwise meeting our audiences halfway).

Relevance

There were lots of strong or significant signals and in our dataset, and lots of ways that you could potentially visualize those signals. Yet, our participants did not just evenly or randomly sample among fields in the dataset: they often were looking to illustrate specific hypotheses or relationships that they suspected people might care about.

While the selection of particular fields was part of this filtering process, some of this relevance information was repeated choices of chart designs. For instance, while there’s a lot of information about people in the data set (say, how much water or caffeine or sugar or alcohol they self-reported as consuming), not all of that information was about health outcomes (say, diabetes status or self-reported feelings of depression). And a very common chart design (by some counts, even the modal design) was to specifically see how demographic information relates to health outcomes (say, whether people who consumed more alcohol were more likely to have diabetes). In other words, not just any two combinations of fields would do, for a good recommendation, even if it had a strong statistical “signal” or other feature that would get it picked up by an auto-insight system: it had to tell a relevant story that our participants could rationalize and reason about.

Interestingness

The last value is both the most vague but also the most alarming. And that is for the resulting charts to be of immediate and clear interest. Even if the chart was about relevant variables, if the chart didn’t show a strong trend, outlier, or other distinct visual or statistical pattern, our participants were quick to discard or down-weight it. Only one participant had anything remotely positive to say about charts of negative results:

It doesn’t show me anything, should I change it? But I mean, that’s sometimes good to know to see that there isn’t really a trend between two things.

That was it, across hours of statements and dozens and dozens of charts. Otherwise, the preference was always for clear take-aways or trends or patterns (and ideally just one take-away per chart, in line with simplicity).

Wrap Up

To me, this work suggests that we need to seriously rethink the goal of an “auto-insight system” as some totally automated process where data goes in and insightful charts come out: the human analyst and the analytical process brings a lot of nuance, background, and opinion to the process that a one-shot fully automated process is unlikely to replicate. Rather, while I acknowledge that automating away some of the tedium of finding things in your data would be useful, to me I think this points to the power and necessity of conversational and iterative analytical tools: the very same sorts of potentially laborious hypothesis formation and exploration that automated systems promised to do away with. Even if it’s just to “prime the pump” and inject some domain knowledge or constraints into the system, or to provide feedback or corrections after the fact, to make these sorts of insights legible and interesting requires at least a dialogue.

But lest you think that human beings are off the hook, this work to me also points to some challenges for the fully “manual” exploratory data analytics process. For one thing, people seem to like simple (but maybe incorrect or insufficiently nuanced) stories rather than more complex ones: getting people to accept and embrace (or even just understand) complexity or uncertainty in their data is difficult. The other is the very human bias to want to see something interesting instead of nothing, even if really there’s nothing (and certainly nothing statistically rigorous) to see in the data set. Our instincts for apophenia are pretty strong: we want to see patterns even when there are none. To me this suggests that there is a design challenge around not just auto-insight systems but also auto-non-insight systems: how do we convince users that their data isn’t sufficient to draw the sorts of conclusions they might otherwise want to draw?

--

--

Michael Correll

Information Visualization, Data Ethics, Graphical Perception.