In a perfect world, we wouldn’t have to think much about how our visualizations are made. We’d press a button, drag a few fields around, and get a nice shiny bar chart showing the data we want to see. However, we don’t live in a perfect world. Often our data are noisy, incomplete, irregular, or otherwise dirty. Before we make too many conclusions from that nice bar chart, we need to perform sanity checks on our data to make sure that nothing is wrong. These sanity checks are part of the process of data prep and cleaning that forms a good chunk of the data scientist’s time.
In these sorts of sanity checks, we typically rely on a combination of summary statistics (means, standard deviation, ranges, etc.) and simple visualizations (histograms, dot plots, box and whiskers plots, etc.). Simple summary stats alone may not catch patterns that indicate that something is wrong (just look at the weird patterns that don’t show up in the stats of Autodesk’s Datasaurus Dozen). Therefore, we usually rely on visualizations like histograms or dot plots to pick out the data issues that our simple summary stats missed. The assumption is that through visual inspection, we can catch the data quality issues that we couldn’t detect with our stats. But does this assumption hold? How reliable are visualizations as sanity checks?
We’re in big trouble if most data quality issues are not readily visible. In particular, it means that we shouldn’t trust visualizations that we see during the course of an analysis session unless they’ve been thoroughly vetted by some third party (for instance, statistical anomaly detection procedures that we know work well on this kind of data, or manual inspection of the raw data tables). This work is an attempt to see just how reliable and robust these patterns are, and how much of this vetting or manual inspection we need to bake into our analysis tools or processes.
We investigated this questions from two directions. The first was adversarially: if I’m a nefarious designer of a visualizations, can I produce a visualization that faithfully encodes the data, but totally hides the fact that there are big data quality issues that need to be addressed? The other was empirically: how reliably can people pick out flaws in their data given common types of summary visualizations?
First off, we created what I call the “Adversarial Visualization Creation Kit” (if you’ve got a better name with a better acronym, I’m all ears). This was an attempt to see just how bad this problem can be. The system generates a sample of points, and then you add any of a set of data quality issues to this original sample (including things like big spiky modes, outliers, or just per-item random noise). The system will then try to find the parameter settings (in terms of the number of bins in the histogram, the size and transparency of points in a dot plot, and how much smoothing to do to generate a density plot) in order to make the “clean” and “dirty” data look as similar as possible, in terms of the amount of pixel color difference between the two visualizations. This is an attempt to create a pair of visualizations that look very very similar, but hide potentially conclusion-altering flaws.
But of course people aren’t professional pixel comparers; the way that we compare visualizations and identify potential anomalies has very little to do with average pixel changes. So we performed an experiment using a lineup protocol where our participants had to identify the one guilty visualization among a sea of innocent ones. Here’s an example: one of these 20 dot plots has a region of missing values somewhere in the middle, while the other 19 are just innocent random samples from a normal distribution:
What we found is a little unsettling: even with “reasonable” design parameters (such as those defaults encountered in popular analysis tools like R and Vega), it’s possible to hide significant data quality issues like missing data, repeated data, and outliers in our standard charts. Under particularly bad settings, people were only a little better than chance (for instance, finding missing data in histograms with small numbers of bins).
Our results suggest that we need to be both careful in our own analysis and skeptical of the results we glean from visualizations where we haven’t done our due diligence on the underlying data. Potentially, we may need to bring in automated anomaly detection methods from statistics before we proceed with our analysis. Another solution is to rely on multiple charts and interaction to make sure that we’ve explored more of the parameter space than just the naïve default. In either event, a short glance at a histogram may not be enough to tell the full story of our data.
This article was written by Michael Correll, describing a paper we will present at IEEE VIS 2018, co-authored by Mingwei Li, Gordon Kindlmann, and Carlos Scheidegger. For more information, read the paper. The guilty visualization in the teaser image is in the first column, last row. The guilty visualization in the main article is in the last column, second to last row.