Scientific Peer Review - Umbilical Cord or Corduroy Umbrella?

We are set to be flooded with AI-generated promotional text that masquerades as science. In principle, scientific peer review should protect us. But current peer review is like a corduroy umbrella – cumbersome and ineffective. Is it hopelessly flawed? Here I suggest what is required to turn it into something akin to an umbilical cord - a device that filters out contaminants and passes nourishment through to the next generation

When I was about 8, I asked my Dad for a ranking of the worst words you could say when he was my age. The list he gave matched my own fairly closely. “Frig” was the only surprise entry – anngrams search confirms that it was at a low ebb in 1990.

I recently came across a list of cursewords written in hardpressed pencil by the hand of my 7-year old daughter. I was delighted. It is progress that the most objectionable words right now are those that denigrate types of human rather than those that describe bodily acts. It gives the heartening impression that the truth will out and that good ideas will rise to the top.

Once crummy ideas grab hold, they are sticky

But a word that upsets me is “insofar”. You might think that my problem with it is that it takes three words and squishes them together to form a new word that has precisely the same meaning and pronunciation as “in so far”. But no, quite the opposite. My problem with it is that it fails to squish in the inevitable fourth word. It troubles me that the word “insofar” exists at the expense of the strictly superior word “insofaras”.

I would like to say that each time I read it I feel a prick of despair that we cannot trust that the best ideas will rise to the top. But I cannot even say that. The truth is, it fails to register with me. I read the phrase “insofar as” published on a page of The Economist and I neglect that there is some of the most valuable real estate in the world wasted by that gap between “far” and “as”.

And therein lies the problem. Crummy ideas can persist (and even get reinforced) simply because they took hold so early that we now take them for granted. Even the countervailing motives that drive the editors of The Economist cannot be relied on to displace them.

The Coming Flood

Of course, if crummy ideas and profit motives align then things get much worse.

Let’s imagine that I am hired by the bubblegum industry to offset thedecline in gumchewing among kids. One route that I could pursue is to influence regulators’, courts’ and laypeople's perceptions of my product e.g. by causing them to believe that gumchewing is a buffer against anxiety.

Two years ago, it would have been pretty costly for me to do this. Even if I faked the study by typing the results into an excel spreadsheet, it would take days to deliver a single article. And the end result would sit at the bottom of the Google Scholar search result, just another piece of grey literature with no citations.

But now, an LLM will write that paper in minutes and so I would get it to write dozens or even hundreds of papers showing related results: the benefits of bubblegum blowing for GPA; its protective effects against dementia etc. etc. Then, I would have each cite the others so that any one of them has hundreds of citations. Now these papers appear highly impactful and they will be among the first articles a reader sees when they search the literature on chewing gum. And, which is more menacing, they will also feature in searches for anxiety and dementia.

The Scourge

What I describe here is not is some distant and speculative future; in fact, it more closely describes the past. In the 1940s, a medical doctor named Arthur Sackler joined a medical advertising agency where he pioneered a form of marketing that presented itself as science. One ruse was to have credible medics put their names to papers that were secretly written on behalf of pharmaceutical companies. That practice was implicated in the cancers of 14,000 women in the early 2000s, when it was revealed that the pharmaceutical firm Wyeth had commissioned a communications firm toghostwrite scientific articles endorsing their hormone replacement therapy.

Sackler’s name is now best known for his family firm’s callous marketing of Oxycontin, but it is not obvious that the hundreds of thousands of deaths hastened by that drug is the most harmful piece of his legacy. TheNew Yorker quoted psychiatrist Allen Frances as stating “most of the questionable practices that propelled the pharmaceutical industry into the scourge it is today can be attributed to Arthur Sackler.”

It’s worth unpacking that scourge. If what comes to mind is an annoying commercial telling you to ask your physician about dialarex, you’re not wrong. The downstream consequence of those commercials is that people do ask for dialarex and physicians receive bonuses for prescribing it and so sales of dialarex boom. Correspondingly, a generic drug that could have delivered the same clinical benefit at a tenth of the price is left languishing. It is not a coincidence that medical spending went from $5 of out of every hundred spent in US in 1960 to$18.30 out of every hundred today. That is not merely wasted money; it is also a source of overmedication, overdiagnosis and hence ill health.

The waste and harm that lead Allen Frances to label the pharmaceutical industry a scourge will soon characterise many other aspects of our lives. The pharmaceutical industry just happens to have certain features that made it an early mover. On the demand side, its consumers – physicians – are especially receptive to advertisements presented as science. On the supply side, the fact that it is very costly to develop a drug but very cheap to make additional pills made it worth spending a lot on boosting demand. Of all the firms that would be willing to incur the costs of advertising via research, a pharmaceutical manufacturer is exactly who we would expect to see investing.

Now that the costs of producing spurious research are trivially low, however, it will be worthwhile for virtually any firm. And as the Covid pandemic made clear, there is no shortage of individuals who are willing and (consider themselves) able to parse a scientific literature to come to their own conclusion. The scourge that Allen Frances identifies as afflicting the pharmaceutical industry today is the scourge that we should expect to be all-pervasive immanently.

Losing Ideas

One consequence is increasing risk that great ideas will go unrecognised. Two years ago, the costs of producing a scientific paper were so high that the steady trickle of new findings were mostly sincere efforts at making some contribution to scientific knowledge. But shortly AI will turn that trickle to a flood and the overwhelming majority of what is produced will be designed for profit or PR or for some other purpose where truth is an irrelevance.

How are we to distinguish the useful insights from the dross?

Peer-review is purpose-made to discern quality research. In my field (economics) peer review works as follows: a scientist submits an anonymised version of their paper to a journal for review; an editor who possesses some expertise on the topic of the paper selects two or three researchers who have deeper expertise on the topic to review the paper; those reviewers assess the merits of the paper, make a recommendation to the editor of whether to publish-as-is, reject the paper outright or accept it subject to certain clarifying questions being adequately addressed by the authors. The editor then shares the anonymised reviews with the author.

There are a couple of features that make peer review approximate St. Peter at the gates of Heaven - an incorruptible, unbiased judge of merit, blessed with a deep knowledge that promotes only the truly worthy to the pantheon.

It is anonymous, which reduces scope for corruption. Many journals use double-blind peer review, where reviewers cannot see who has written the paper, which is especially valuable because the incentives align to promote the most insightful ideas, regardless who thought them up or how unpalatable they might be to certain audiences.

It is the independent opinion of two or more experts and an editor’s judgment also. This, while not quite harnessing the Wisdom of the Crowd, offers at least some buffer against caprice.

In practice, peer review falls far short. Its more egregious failures are well documented. It fails to spot false results e.g.the faked dishonesty study that ended up costing the Guatamalan government dearly. It fails to recognize true innovations e.g.the Nobel-prize winning one that has been credited with shaping everything from the Affordable Care Act toCarfax reports.

But there are many more mundane issues. TheData Colada blog is an excellent place to see these laid bare. Take a recentrandomized controlled trial of a brief training program conducted among 2,070 police officers reported in the Quarterly Journal of Economics. It concluded that training police to consider their options reduced use of force and discretionary arrests. That’s a very useful and scalable finding. But the Data Colada blog makes a very compelling case thatit is not reliable.

The red flag here is that the authors had preregistered their hypotheses and so Data Colada (and you, me and, notably, the reviewers who accepted this paper for publication) could look to see if the reported results are the same as the results that had been hypothesised prior to data collection. They were not.

Now you might say all’s well that ends well and science has worked because Data Colada has updated the record. But there are three problems with that take. First, this is wasteful. A really resource-intensive and potentially scalable experiment was conducted on a topic of great interest and, because of a lack of transparency, we simply don’t know if it worked. If peer review had worked well, we would be able to see all the results the authors had preregistered rather than just the ones they opted to show us. Second, it is beyond Data Colada and similar blogs to clean up all the messes left by lackadaisical reviewers. There will be many specious results that they lack the time and attention to expose and those will mislead real-world decision making. Third, the damage has been done. Jorg Peters and co-authors show that comments published in the flagship American Economic Review are cited far less than the original research that they correct. Even when the authors of the original paper concede that the comment substantively tempers their conclusions, the comment has orders of magnitude fewer citations than the original paper. This points to the value of peer review as a screening device that works ex ante. It is far more effective to avoid contaminating the literature in the first instance than it is to correct it later.

How discriminating is our filter?

We need something that acts as a filter that extracts contaminants and passes to future generations the stuff that is conducive to growth, health and wellbeing - something like an umbilical cord. What we have is more like a corduroy umbrella – not only patchy in its effectiveness but cumbersome too.

Consider 100 articles submitted to the top-tier social science outlet, the Journal of Political Economy. The editors judge 49 of them to be worth sending out to reviewers. The latest data show that only 6 of them are ultimately selected for publication. The median time it took to let authors of the remaining 43 papers know that reviewers had rejected their submission was3 months. That is 3 months where the paper cannot be submitted to another outlet. Papers are languishing, even though the editors of a top journal in political economy have already selected them as likely to offer important contributions to knowledge on social phenomena. And this is happening at a moment when the insights of political economy might be especially helpful in restoring faith in institutions.

This delay also has the effect of undermining the objectivity of peer review. Because of timelags, academics typically post their work online as a preprint so they can accumulate citations and influence and avoid being scooped. As a result, by the time reviewers come to read the paper they can already see how many citations it has, as well as the authors’ names and affiliations. That matters because it detracts from the independence of peer review. At a conscious level, it takes bravery for a reviewer to reject a paper that has already accumulated hundreds of citations. At an unconscious level, we know that evaluations are influenced by confirmation bias and so a paper is more likely to be given the benefit of the doubt if it has already received hundreds of citations. And this weakness is exploitable - we can infer from their failure to check preregistration that reviewers are unlikely to check whether the hundreds of citations predominantly come from AI generated dross.

Most insidious, the timelag leads to more conservative research agendas. When choosing among research projects, scientists have to weigh up expected benefits (e.g. citations; prestigious publication) against expected costs (e.g. the opportunity cost of their time). The lower the time costs, the more risk a scientist can afford to take. When the time costs are high, scientists are all the more incentivized to take on projects that are guaranteed citations (i.e. that contribute to established and growing literatures) and that are least objectionable to reviewers (i.e. that align with rather than debunk established theories). The net effect is fewer groundbreaking ideas.

Is Peer Review Hopelessly Broken?

There are efforts to reform peer review. Thanks to the Data Colada group and others, many journals require authors to post their data and code. Some journals even employ people to check that the results reported in the paper precisely match those obtained from running the analyses. But, as the Data Colada blog demonstrates, unreliable results still turn up even in the places that we would expect to be most buffered against them i.e. top journals in economics and psychology. The editors and reviewers at these outlets understand incentives and human behavior and are skilled interpreters of data but that is not sufficient to protect against specious results.

The scarce factor is motivation. Reviewing another scientist’s work is never a priority. It only ever gets done out of civic-mindedness or, perhaps in some small number of cases, out of a selfish desire to promote one’s own research. The false results that enter the literature are one symptom of this lack of motivation. They get attention because they are salient and make for a good story. But this lack of motivation creates bigger problems through the drag that lags to publication place on progress e.g. the research that does not get done.

Attempts are being made at making peer review more efficient - a website collates thevarious efforts. One strand of research has experimentedwith paying reviewers or waiving journal submission fees. Another approach creates a publicly available record of how many reviews a researcher has completed. Platforms like Publons and ORCID allow researchers to signal to promotion panels their productivity as a reviewer alongside their productivity as a researcher. A third strand publishes reviewers’ comments alongside the paper, giving them credit and potential citations (though at the expense of double-blinding).

While there is some evidence that paying reviewers hundreds of dollars can speed up peer review, there is nothing to suggest that we have yet found a scalable mechanism that will make peer review function as the discerning filter we require to deal with the coming flood of AI-generated content.

For that, we will need to make peer review a priority among scientists instead of the afterthought that it is currently. We will need an incentive structure that rewards scientists for the quality and timeliness of their reviews. We will need a market for timely and quality peer review.

A Job Market for Quality Peer Reviewers

Just as Google Scholar publishes researchers’ i-index, it could publish a new metric that measures the contribution to science they make through their reviews. I call this an R-Squared score and it accumulates through points that editors give for timeliness (0 for late, 1 for acceptable, 2 for exemplary) and points that the other reviewer of the paper give for quality (0 for unhelpful; 1 for fine; 2 for exceptionally helpful).

But - you will no doubt say - the last thing we need is more burden on reviewers and editors. Here's the thing - almost all of the colleagues I have spoken to about this idea fall into one of two camps: those who already read the other reviewer’s report out of curiosity or those who don't want to be reviewing in the first place. So in a world of R-Squared scores, there is no extra work for the first camp and - because their choice not to display a score will signal to editors that they don't want to receive an invitation to review - there is less work for the second.

And it will save editors work because it allows them identify quality and timely reviewers. Currently editors lose time sending papers for review to people who fail to reply e.g. because they have left academia. Or, to avoid that timewasting, they send ill-fitting papers to people who they know to be timely reviewers. Either way, there is a mismatch. By providing publicly available information on researchers’ reviewing activity we can help editors quickly and easily identify content experts who are also active and engaged reviewers.

Providing editors with information on reviewers’ quality and timeliness will reduce academic fraud too. In recognition of the fact that editors waste a lot of time finding reviewers, many journals invite submitting authors to nominate reviewers. This practice is sometimes abused by citation rings - groups of authors who big up each other’s work. R-squared obviates the need to rely on reviewers nominated by the author.

R-squared provides feedback on review quality that is currently missing. As it stands, there are reviews that are wrong-headed, either out of sincere misunderstanding or out of motivated self-interest. The 0-2 score reviewer reviewer score can flag quality concerns so that editors know to give these reviews special scrutiny. More than that, the mere awareness that one’s review will be scored will prevent egregiously self-serving reviews in the first instance.

Changing the scientific ecosystem one review at a time

It is a grand ambition to seek to transform the scientific ecosystem. But the incentives align to make R-Squared a scalable technology. If just one journal adopts the practice of scoring reviewers for timeliness and quality then, merely through the mechanism of making the importance of timeliness salient to reviewers, we would expect it to achieve faster turnaround times. Concurrently, that journal will be accumulating data on the quality and speediness of its reviewers that it can use later to match quality reviewers to papers in their areas of expertise. As authors experience more rapid turnaround times, they will submit their work to this now speedier outlet instead of to its rivals. Other journals will have to adopt the technology or lose out on submissions, influence and prestige.

A published metric that summarizes a researchers’ frequency and quality creates a viable career track for expert reviewers. It will allow for the first time promotion committees to recognize excellent reviewers’ contributions to science. Just as there are game-changingly talented music critics, film critics and literary critics, there are scientists who are excellent arbiters of others’ research. But their contribution to science is not currently valued and so, unless those scientists also happen to be excellent on the separate set of skills required to produce original research, we are likely losing them from the scientific community. These people will be increasingly valuable to science as the quantity of research explodes. Progress requires that we find a mechanism to reward them.