Hell is Other Reviewers
You can have the other words-chance, luck, coincidence,
serendipity. I’ll take grace. I don’t know what it is exactly, but
I’ll take it.
–Mary Oliver, “Sand Dabs, Five”
I have become increasingly uncomfortable reviewing papers in my field recently. And not just because about a third of the papers I’m seeing these days are about LLMs, or bolting on LLMs to things that don’t need LLMs, or trying to make interfaces for LLMs, and I have run out of ways to be constructively critical about LLMs. It’s more that I see myself falling back into bad habits, failing to be either kind or useful (and you’d hope I’d be at least one, and ideally both, right?) to either the people whose work I have reviewed or even myself. There have been a number of times where I have been reviewing where, just like sometimes (and I am really hoping this is a universal experience here) you have a dissociative experience during a party or conversation where you can tell “I am talking too much” or “I am making a fool of myself” but still seem to be a passive third party observer as the rest of you keeps nattering away, so too have I seen myself writing the exact kinds of reviews that I would find annoying or useless if I were to receive them on my own work but felt helpless to do anything else. Both in my own reviewing, and in the reviews I receive back, I keep thinking about the need for grace.
Before I begin unpacking that, I would like you to indulge me on a little guided visualization (in the ordinary mental picture sense, not the data visualization sense). I want you to imagine that you have submitted a paper. It is not a perfect paper (nothing is), it’s not likely to be best paper anywhere or turn the field on its head, but you think it has value, even if just as the first start of some bigger idea. Now I want you to imagine the ideal reviewer for that work. For me, it might go something like this:
My reviewer finishes steeping a big pot of tea made from the herbs and flowers picked from their garden in the back of the seaside cottage where they live with their loving spouse. They put on their reading glasses and carefully thump the printed-out copy of my paper against the well-worn but clean wooden kitchen table where they do most of their work (they have an office with a desk somewhere in the back, but that’s just a room for books and writing greeting cards). They have had a nice breakfast (but not one so large they are sleepy), and, nice blue felt-tip pen in hand, they begin reading. They, of course, provide lots of feedback (just having reviewers reply “looks good, accept” would not fulfill my psychological need here; I’d want them to mean it, and still give me just enough rejections so I don’t get complacent or lazy while still keeping morale up), but it is like that old fake quote about how Michelangelo sculpted David by just chipping away everything that doesn’t look like David — they see what the paper could be, underneath all of the grammar errors or awkward language, and they feel their job is to help me bring that idealized paper to fruition, to cultivate it like you might a rose brush — a little pruning here, a little fertilizer there. More sun and more water. They will of course disagree with me from time to time, but that’s just an opportunity to bring in more perspectives or find room for common ground and consensus, or even find a whole new way of looking at the research that I was too close to the work to see. A cat nuzzles against their legs at some point. When they are finished, they walk along the beach and listen to the waves.
You get the idea. I would like you to contrast this story (or your personal equivalent) with the kind of story you might tell about what Graham Cormode calls (in a great essay that was one of the several motivations for writing this post) an “adversarial reviewer”: one who is overcommitted and behind on their reviewing load, cranky and overworked and juggling too many things, poked and prodded by reminder emails or other distractions, holding residual grudges from their own rejections, bearing aloft the red pen of judgment and looking for an excuse to reject the paper in order to get it (even if only temporarily) off of their plate so they can go back to doing other things that they find more rewarding. This adversarial reviewer, perhaps through no intrinsic fault in their character, but through circumstances in their life, is not careful or patient or kind. Their cat might try to nuzzle their legs during the reviewing process, but one imagines that the adversarial reviewer would gently push it away.
We in computer science are often adversarial reviewers, in ways big and small. The systems in which we are immersed (getting periodic dumps of large numbers of conference papers to review, with the tacit expectation that the conference will need to reject the vast majority, in the midst of lots of other things we could or should be doing) promote adversarial reviewing. Our incentive structures (with no public incentives for reviewing well, and, in a mean way, the fact that more rejected papers means less competition) are likewise adversarial. As an example of this built-in adversarial nature, just think about what a “win” looks like, for a paper writer: it could be a paper that gets in with flying colors on the first try, or a paper that, after multiple rounds of rejection from doubters, is finally accepted and becomes notable or award-winning. But what does a “win” look like for a reviewer? In the story above, it was something like growing or cultivating a good paper into a great one, but I don’t think most reviewers see it that way. When I hear people talk about notable experiences as a reviewer, it’s often in adversarial terms (not necessarily adversarial against the paper author, but adversarial nonetheless). Things like keeping a superficially slick but fundamentally flawed paper out of the conference. Or, more positively but still adversarial, defeating a group of negative reviewers in the discussion phase to get a paper you liked just barely through the reviewing gauntlet. Reviewing wins just often aren’t very nice. Neither are reviewing “losses”: there’s the ego damage of writing the lone naïve positive review for a paper other reviewers were able to spot obvious flaws with right away, or being the outlying curmudgeon among a cohort of people who all seem to be more constructive and coherent than you. A lot of reviewing outcomes just don’t feel very good for anybody involved, not just the authors who come away with yet another anecdote about yet another absurd and unreasonable reviewer two.
Right, you might say, but science isn’t always about being nice. We are aspiring to rigor or (heaven help us) maybe even Truth, and as iron sharpens iron, so one person sharpens another. But, as I have remarked on elsewhere, papers in computer science rarely actually die. Reviewers in computer science who think they are being harsh for the field’s own good aren’t (necessarily) thinning the herd to ensure the strong survive or performing any sort of pruning or preening of the discipline’s tree: they are usually just delaying a paper by a few rounds, hitting some metaphorical snooze button somewhere until the paper, perhaps with some additional analyses or citations or battle scars or authorial traumas or in a lower tier venue but perhaps not fundamentally changed, ends up in the literature. It’s a lot of work and time and enmity for what ends up being, at best, minor-to-moderate polishing (and not always even that). We are spending so much time and effort on this whole peer review thing and it’s hard to say that we’re getting much in the way of concrete benefit out it.
I mentioned this in yet another, even older, post on reviewing, but one of my related pet peeves is people who treat reviewing as though it’s the same as grading homework. “Homework-style” reviewing encourages people to go through a paper with a literal or metaphorical red pen looking for errors, and then, when they (inevitably) find some, docking literal or metaphorical points. Since no paper is perfect, no paper gets a “perfect” score (when, in fact, a perfect score just means, in most places, “I would argue for accepting this paper” — a much lower bar than some notion of ontological paper perfection). And, since if you dock enough points then the student should fail, the result is that a bunch of minor extremely fixable issues (like not providing enough details or references or examples) get rolled up into vague top-level “issues” in the meta review like “reviewers raised concerns about the lack of detail in the experimental methods” or whatever, that end up being totally valueless as a causal explanation for authors for why a paper was rejected (there are always more potential issues that can be raised, or more details that can be asked for!). At best, what I get from those kinds of decisions is that reviewers weren’t excited enough about the idea and were looking for an excuse to justify what might have fundamentally been a vibes-based decision. Or, less adversarially (see, I’m doing it even here): a paper with an idea in that that you’re really excited about can “shine through” even a whole host of nitpicks or concerns, but, if you don’t see that great idea, the minor concerns might be all you can see in the paper.
This homework grading impulse is especially frustrating to me both as a reviewer of quantitative empirical work and a reader of reviews on my own quantitative empirical work. The stringency, depth, and even competency of these reviews (and, again, to be fair, my own) seems to vary wildly, which means that there’s (at least to me), not a reliable connection between the statistical rigor or quality of a paper and what kind of reviews it gets, and the potential fatal meta-review line item about “issues” in the experimental work seems to cover everything from genuine and severe statistical or experimental issues to low-level concerns that can be fixed by just running a line or two or R code to people being disappointed that the experiment that was conducted doesn’t match up with the idealized experiment in their head that they think they would have ran. As a result, I’ve started to wonder how much the limitations sections in papers in my field are meant to convey genuine useful information and cautions about interpreting experimental results, and how much of the text in them functions as a dumping ground for feeding reviewer egos: “yes, you are very smart to notice all of the things we could have done instead.”
Beyond treating reviewing like grading homework, there’s also the issue where it’s treated like receiving homework. This is total anec-data but I’ve gotten at least head nods from everybody I’ve talked to about this so it doesn’t seem totally wild, but it seems to be that there was a clear inflection point after the start of the COVID pandemic where the task of recruiting reviewers who would produce high-quality reviews on time went from a kind of mild inconvenience to a periodically Sisyphean ordeal. The resulting quality of reviews goes down, or at the very least becomes much more variable, because the whole process happens in a rush and with overcommitted and/or under-experienced reviewers. More anec-data here, but, as I mentioned in my IEEE VIS 2024 trip report, my suspicion is that anybody in my field with even a cursory public history of research in both data visualization and AI is getting their inboxes inundated with reviews for papers in ways that either mean these rara avises are either declining a lot of work that will go on to reviewers who lack the right flavor of interdisciplinary background to helpfully review papers, or are just absolutely losing their minds with overwork and overcommitment.
The last thread of critique I have of Actually Existing Peer Review is that it encourages a specific flavor of “anticipatory compliance.” I’ve now heard multiple very senior people in my field make statements of the form “oh, we did this cool thing but there was no novelty to it, so we knew it would never pass the reviewers, oh well, too bad” or “oh, we know we don’t really need a study for this, but reviewers expect it, so.” These complaints might in fact be true, but I’ve always found them to be deeply unsatisfying. For one, these complaints are always attributed to some absent Other (“of course you and I know that you don’t need to tack on half-assed quantitative studies, it’s those dastardly Other reviewers who always insist on studies when they aren’t required!”), and I’m always skeptical of whether these Others actually exist as stated, or if they are just projections of our own psyches used as excuses or preemptive rationales for timidity. For another, it’s usually senior people making these complaints, and they are the people allegedly performing a good chunk of these reviews, or at the very least training a good chunk of the people who perform them. And, lastly, the existence of large chunks of interesting or impactful work that doesn’t fit the existing patterns is taken as either a historical oddity or some wild exception that’s only possible if you’re particularly skilled or particularly famous. The results, however, are the same: people discouraged from doing things that could be useful or interesting because they don’t fit an assumed mold, doing ritualistic science-y cargo cult stuff they don’t really believe in or understand, or tying themselves in knots to get a “paper-shaped” contribution out of work that could have been much more interesting or much more useful as something other than a 10ish page pdf behind a paywall somewhere.
To recap:
- Reviewers are not well-incentivized to be kind or generous, but have much more incentive to be perfunctory or dismissive.
- Even when reviewers are kind and generous, it’s all a lot of work for very little return on all of that investment in terms of either reliable signal of paper quality, or useful feedback to improve a paper.
- Even when reviewers are kind and generous and are doing useful work, we’re so traumatized by what potentially non-existent mean or hidebound reviewers do that we preemptively do worse or less-interesting work to please them.
Grace-full Reviewing
The result of all of this being that peer review, even if you accept that it does what its “supposed to” do (weed out incorrect or low-quality work) — which, by the way, I think current peer review is pretty shitty at doing, and certainly shitty at doing in any kind of reliable way— still produces poor outcomes and needless conflict and makes everybody involved just a little bit more stressed, more mean, or more upset with one another.
With the preliminaries out of the way, let us return to grace. To me, a graceful reviewer is open-minded to new work and new perspectives. Rather than tasked primarily with separate the sheep from the goats, the graceful reviewer is there to cultivate a paper, to help it flourish into an impactful piece of scholarship, likely over several rounds of iteration and refinement. When someone with a graceful reviewer puts “we thank the anonymous reviewers for their valuable comments” in the acknowledgments of the paper, it should be not even a little bit sarcastic. I think being a graceful reviewer is both a skill and a mindset, and probably a thing you can only really commit to doing, in a real and full and useful way, a few times a year, and probably with only one or two papers at a time given all of the other jobs that academics have to do. Being a reviewer in this sense is likely a similar burden, in terms of work and intellectual contribution, to being a co-author. I’m thinking like Homer calling on the Muses here, or how Hildegard of Bingen would write under the guidance of the umbra viventis lucis, or Madame Blavatsky would write under the instructions of the Ascended Masters. Like supernatural levels of writing guidance. Certainly not an anonymous person writing a paragraph or two.
Even in the realm of mere mortals, we currently have few incentives to produce models of reviewing that are this kind of longitudinally close and emotionally kind, except in very limited circumstances: some of the relationships that authors of books have with their editors, or thesis writers have with their thesis advisors, can approach this level of collaborative reviewing effort. If we want to build more of these kinds of relationships, then the deck is stacked against us here, since it calls for a dramatic reduction and/or slowdown in the pace of academic publishing. And the academic publishers want us to publish more, because more papers is more money. The universities want us to publish more, because more papers means higher metrics means more money. And academics (especially students and early career researchers) want to publish more, because the first two pressures means that the length of the C.V. needed to be “taken seriously” or even just get your foot in the door is continuously and unsustainably expanding. The reduction of a scientific paper from a form of earnest communication of new ideas to a fungible token (you did X amount of work, so you’ve earned Y papers, and in turn can use those in exchange for Y professional rewards) is not an environment that encourages grace (or good science, for that matter).
So what to do here? Especially in a situation here where we’re surrounded by extremely conservative (or at the very least hard to steer) institutions like academia, stuck in interlocking sets of prisoners dilemmas where it’s hard to make drastic changes without potentially screwing over your students’ (or your own) increasingly dire-looking career prospects?
I think we should trust-bust. Peer review has become a monopoly (in this case, on what counts as “real” scientific expression) and, like other monopolies, has become complacent: extracting increasingly high tolls for increasingly diminishing rewards, encouraging complacency and bloat. At the macro level, this could mean re-weighting how much work goes to established academic publishers versus more more informal venues. But I think there’s also interesting trust-busting at the micro level as well, which is to break apart all of the myriad jobs that are rolled into the umbrella of “peer review” and make each of these extraneous, less grace-inducing, parts of the job somebody else’s problem. Some of this breaking apart is stuff that happens to greater or lesser extent already in the process, either generally or in specific academic sub-areas. For instance, you could have separate procedures for:
- Initial Gatekeeping: Knowing that something is plagiarized, incomplete, or totally out of scope (and/or AI slop) is something you can check for without having to recruit a bunch of experts in a particular subject (or at least, without having to recruit experts in a particular research subject). Right now how different fields and journals handle desks rejects is kind of weird and idiosyncratic. Some vibes-based desk rejection is fine, but I think we need (perhaps somewhat contradictorily) both more automated processes (for instance, to detect “tortured phrases” and other signs of plagiarism that often go undetected until publication, if at all) as well as more manual sorting work (ideally done by people who aren’t journal editors or reviewers). But in either event, I think having a frontline of non-reviewer-based filtering making an initial “is this potentially worth your time?” judgments will encourage more grace in the following steps.
- Giving out “tickets” and “tokens”: For some reason we’ve decided that only certain kinds of academic contributions “count”, which has turned into a perverse incentive structure where academic work is only performed if there’s a chance of a countable payout at the end (for CS, it’s conference papers in select sets of conferences; for other fields it might be journal articles in select sets of journals, or even book chapters). Connected to this is the increasingly expensive and unsustainable necessity(?) of going to in-person conferences, where you often need to have a paper to justify the time and expense of going. As long as one of the primary goals of reviewing is determining whether or not a work “deserves” these artificially scarce tokens of worth, it can’t help but be adversarial. Just give out the tokens profligately, and let the sorting happen in other ways.
- Filling out conference agendas and journal issues: If we take the step above, then we’ll have a lot of accepted work. Then beancounters and other people will complain about not having ways of separating the sheep from the goats. I still don’t know if this sorting is something that an individual reviewer needs to be doing, though. For instance, the model (already quite common in the humanities) where you submit an abstract that’s more like a promise of a paper to come, and then the actual paper gets written once you’ve gotten a thumbs up “seems like a neat idea” from a lightweight reviewing or jurying process. Or the model (common in the sciences) where you submit something and it’s generally accepted as, say, a poster, but a subset of accepted submissions give a formal talk in addition to presenting the poster. Reviewing helps guide these decisions, perhaps, but it’s shouldn’t be the primary job of reviewers.
- Assessing experimental design: I am going to go out on a limb and state, without empirical evidence, that the modal reason why reviewers claim they are rejecting empirical papers in my field is not because reviewers think the ideas in the paper are poor, or the topic not interesting, but because they have issues with the experimental design. As mentioned above, I’m actually skeptical about whether these issues are the actual reason for voting down a paper or are just cover for deeper vibes-based rationales, and I’m also skeptical that my community, on the whole, actually does a very reliable job either designing or evaluating experiments. All that being said, it’s wild to me we have an entire scientific community that will smugly quote Fisher’s “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of” at each other but then decide that the best time to evaluate experimental design is after it has already been conducted and written up. I think comments on registered reports should be the main form of peer review on experimental design per se, with almost all of the back-and-forth arguments about experiments happening prior to data collection. It won’t stop all arguments, and of course no plan survives contact with the enemy, but it would at least make the main work of reviewing more substantive. It also might make people better (or at least more thorough) at designing experiments.
- Auditing and correcting: One of the counter-intuitive things that new authors and reviewers sometimes discover is that reviewers usually don’t do very much actual checking or editing. It’s largely a waste of time to provide line-level edits or corrections for a paper that only has a, say, 20% chance of actually getting accepted, and if it’s rejected it will (hopefully) be for content-related reasons that will require rewrites that likely will end up obliterating those problematic sentences anyway. Likewise, while it’s very common for reviewers to complain about statistics and experimental methods, it’s vanishingly rare that reviewers will dig into the actual code or analyses and do error-checking or debugging or anything like that (even when such things are actually provided, which is also less common than you think). As a result, discovering fundamental errors (like, “whoops, we missed a zero when multiplying”) often happens after publication, as the result of a roving band of auditors, or a future researcher hoping to extend or converse with prior work. This can’t help but make this kind of auditing adversarial (if somebody finds an error you might have to retract! Or issue a correction! A huge amount of work with potentially dire career implications! Wouldn’t you be defensive under such circumstances?). Ideally we’d have much better and less drastic ways of doing this (in the same way that bugfixes and edits are No Big Deal in other kinds of writing, software design, or communication). But, at the very least, if there’s groups dedicated to doing this prior to publication, and this kind of editing and correction is just part of the usual pre-publication process, maybe it will take some of the sting out, in the same way that I don’t treat the red wavy line of a spellchecker as a personal attack.
Okay, so if we’ve refactored out all of the work above, then the job of a peer reviewer, as in a “a peer who looks at the submitted paper and provides detailed feedback” is both much more narrow but also much more constructive. It’s somebody else’s problem whether the paper gets metadata noting that it was accepted to an appropriate venue, or to determine if paper has a missing comma on the third page or if the sample size is justified by a power analysis. Rather, to paraphrase Meister Eckhart, the reviewer is emptied of things so they can be full of grace. The reviewer’s job is then to make the paper in front of them as good as it can be, to work with the authors rather than against them. It’s a thing you do when you really believe in a paper, when you can give a meaningful portion of your time, attention, and energy to it.
In short, while we await a revolution of the publication model, I’ll settle for an evolution of the publication model that looks something like this:
- You (as a singular person or plural “you” research group) have an idea.
- You write up an initial proposal about the idea that you workshop with your peers. Lots of things that, in the past, you might try to turn into papers could get generated (and publicized or published, although not in as formal a venue: remember when workshops were for actually workshopping ideas?) during this phase, and potentially during the other phases as well.
- You prepare an abstract. If there’s an empirical component, you also prepare a registered report. External peers in the venue where you want to submit look it over and work with you to iterate on the abstract/report until it’s in good shape. This step will likely involve a bit of piloting and data collection so you are not totally off base when you actually start the work in earnest.
- You do the research you say you’re going to do. You might end up having to do more things in reaction to what you find, but that’s okay, that’s what exploration is about. But you know that there’s at least a core experimental design that people have signed off on as being not totally nuts.
- You write up your results into a paper-shaped thing. It may look very little like a finished paper, but it’s at least a solid first draft that contains what you want to communicate.
- You send your paper-shaped thing to the venue where you want to publish. An initial auditor looks over it for obvious signs of malfeasance or topical mismatch.
- Once you pass the audit, you go ahead and send the paper to capital-R Reviewers. These reviewers look through the draft and propose things to fix, ways to improve the argumentation, analyses to be refined. At no point is there a scorecard where reviewers say the paper gets a “2/5 for novelty” or whatever bullshit. Instead, this process is purely about improving the logos of the paper and the quality of the presentation. This process may take several cycles of back-and-forth communication.
- When you are satisfied, it goes to the higher ups in the venue. They are highly encouraged to accept the paper, since the topic and plan of research was already vetted way back in step 3, and external people they trust gave it a look over in step 7. So this is just about “did they land somewhere reasonable in their execution.” There may be different levels of acceptance (maybe not all accepted papers go to the conference that year, or they all go as posters but not all get invited talks, or…).
- You (and the reviewers) celebrate a job well done.
I don’t think this is a terribly wacky process, and I note that lots of venues already do many or most of these steps. As a concrete example, I think the “paper reviews as github issues” experiment at JoVI is close to what I mean for at least some of this stuff, and of course there is some of this model that shows up in things like grant writing, book proposals, and in more informal models of peer review like circulating drafts among friends or colleagues.
Immediate objections to the above:
- Won’t this dramatically slow down the rate at which peer-reviewed papers are published? Yes, it will. It should. Well, certain kinds of papers. I hate the notion of papers as these little fungible units of academic currency, so anything that breaks the model here is probably for the better in terms of the actual quality of the material. If you have other ways of getting those kinds of rewards (if pro-forma abstracts get accepted most of the time, for instance), then you don’t have to submit a paper unless you want to, because you have something worth sharing, rather than because you need it to make sure you got a payout for the work you did. A paper is then more like a book or a thesis or something rather than a thing you produce large numbers of. If you have lots of ideas, you can of course still put out workshop papers or blog posts or pointed letters to the editor or other things like that (in fact, you might put out more of these kinds of things, since you aren’t waiting for things to be “paper-shaped” or having to deal with multiple rounds of rejection).
- Who is doing all this other work that you just refactored away from reviewers? Well, some of it is being done already by people like program committees or in journal post-acceptance publication processes. But often (for instance things like grammar-checking, or double-checking analyses), we pretend it’s being done but it often isn’t, or is being done stochastically. I think if for-profit journals want to pretend that they are anything other than parasitic, they should use some of their money to hire and pay people to do some of this work (some journals, to be fair, do). But I also think that we currently are not allocating the finite work and care and grace of people in academia in logical ways. Is it really the best use of the time of some of (allegedly) the smartest people in the planet, with unique and sometimes irreplaceable expertise, to have them sit around debating whether a mediocre paper is a more of a 2.5/5 or a 3.5/5? Shouldn’t they give their expert feedback where it would do the most good, in places where they are most passionate or most informed?
- What about the bad actors? For some reason, there are large groups of people that hate the thought that somebody might take advantage of generosity, and view the kind of mindset I’m asking for as a sign of naïveté or weakness. It’s why we, for instance, spend like over 5 times more on fare enforcement police than those police ever recover in fines for skipped fares. It’s usually horseshit. Furthermore, most (admittedly, not all) of the existing bad actors out there in the academic writing space are there because they are trying to game the system, to boost h-indices or pad CVs or because they have snapped under the pressure of a publish-or-perish system: a system that, regardless of its initial intentions, is now mostly there so hiring committees and governments can be lazy and/or neoliberal in determining who in academia to reward or punish, or publishing companies can make a lot of money. It kind of seems, then, like maybe the problem is the system, rather than the bad actors. I think making the detection of bad actors somebody else’s job (after all, the peer reviewers are meant to be experts in their field, but aren’t necessarily experts in fraud detection or prevention, especially if they are going into this with open hearts etc. etc.) might actually be more reliable than leaving it in the hands of reviewers.
Wrap-up
This is almost six thousand words already, why on earth would you want more words here? I’ll try nonetheless:
- Reviewing feels bad to me: it doesn’t feel like useful work that does good things to the world around me. It doesn’t seem to improve the field very much globally, or make papers much better locally.
- To the extent that you believe my perception should be taken seriously, I think it’s due to lots of inherently adversarially bits of work being wrapped up into a single holistic thing called “reviewing.” Sometimes the bits that are adversarial or zero-sum are that way for not very good reasons, or artifacts of systems that we could easily move past.
- One potential solution is to just make the job of reviewing much more narrow. I want fewer peer reviews by slower reviewers who are rewarded more for their time, and I’m not kidding.