Ethan Mollick, Associate Professor at The Wharton School and author of Co-Intelligence, recently shared something that caught my attention: "Increasing problem with publishing work on AI is that the publication process is much slower than working paper process, so when papers finally get full peer reviews, authors are asked to account for newer papers that are built on the paper under review!" He calls it a kind of "research ouroborous," and if you have ever tried to publish anything in a fast-moving field, you can probably feel the bite.

What Mollick is describing is not just an inconvenience. It is a structural mismatch between the tempo of AI progress and the tempo of academic validation. And because published, peer-reviewed papers are still the currency of tenure and promotion, this mismatch shapes careers, incentives, and the kinds of claims researchers are willing to make.

In this post, I want to expand on Mollick's point and translate it into practical guidance: what is actually happening in AI publishing, why it is getting worse, and what it might look like to build papers designed to survive continuous model updates.

The AI publishing time gap is no longer a rounding error

AI research has always moved quickly, but the last few years have changed the baseline. Model capability can jump meaningfully within a single year, and sometimes within months. Meanwhile, journal and top-conference pipelines routinely take many months to more than a year from initial submission to final publication.

Mollick highlights the core collision: by the time a paper is fully reviewed, it may be required to cite and address newer work that used the authors' own working paper as a foundation.

"When papers finally get full peer reviews, authors are asked to account for newer papers that are built on the paper under review!"

This creates a strange loop:

Researchers share a working paper to get feedback and establish priority.
Others build on it quickly, often in public.
Reviewers later ask the original authors to address these follow-on results.
The original paper gets revised, resubmitted, and delayed further.

None of this is malicious. It is just what happens when dissemination is measured in days and certification is measured in seasons.

The "model version" problem: your results age out before they print

Mollick also points to a second, related issue: working papers are often written about earlier models. Reviewers who know the space will ask authors to re-run experiments on newer systems.

That request is reasonable. It is also destabilizing.

If you evaluated GPT-style models in early 2024, your conclusions about reasoning, coding, or reliability might not hold in late 2025. Even worse, updating to a new model can change not only performance but behavior: different refusal patterns, different tool-use ability, different sensitivity to prompt format, different strengths by domain.

This is why Mollick warns that authors need to plan for updating from the start, not as an afterthought.

Papers increasingly need to be built for easy updating as new models come out.

In practice, many AI papers are still written like traditional one-time snapshots: here is the model, here is the task, here are the results. That format works when the underlying technology is relatively stable. In foundation-model AI, it can turn into a historical artifact almost immediately.

The trap of "AI can't do X" claims

One line from Mollick's post should probably be printed and taped above a lot of keyboards:

The "AI can't do task X well" papers need to instead be rewritten about the trendline, because by the time you publish, it may very well do task X well.

This is not a call to hype AI. It is a call to change the unit of analysis.

A paper that argues "model Y fails at task X" can be valid and careful at submission time, yet misleading by publication time if task X becomes tractable for a new generation of models or with minor scaffolding (tools, retrieval, structured prompting, fine-tuning, better evaluation protocols).

A more durable approach is to describe:

What aspects of task X are hard (data constraints, long-horizon planning, ambiguous goals).
Which interventions matter (tools, decomposition, verification, human-in-the-loop).
How performance changes across model families or over time (the trendline).
What failure modes persist even as averages improve (calibration, safety tradeoffs, brittleness under distribution shift).

That way, the contribution survives capability leaps because it is about dynamics and mechanisms, not only about a single score.

What a "paper built for updating" looks like

If we take Mollick seriously, we need to treat AI papers more like living measurement systems. Not living documents that change their claims arbitrarily, but research artifacts designed for repeatable refresh.

1) Separate the contribution from the benchmark run

A durable AI paper usually has at least two layers:

The conceptual contribution: taxonomy, theory, method, dataset design, causal identification, workflow, or new evaluation framing.
The instantiation: results on model version A, with prompts, hyperparameters, tool settings, and compute budget.

If the paper blurs these together, a reviewer asking for model version B can force a rewrite of the whole argument. If they are separated, the refresh is an appendix update or a results table extension.

2) Treat prompts, tools, and scaffolds as first-class methods

In foundation-model evaluation, "the model" is rarely the whole system. Prompting strategy, tool availability, and verifier steps can dominate outcomes.

So build the experimental design to make these components explicit:

Publish prompt templates and variations.
Log tool calls and constraints.
Document decoding settings.
Explain any human intervention policies.

Then, when models update, you can re-run the pipeline with minimal ambiguity.

3) Pre-commit to update rules

One reason updates are painful is that they feel like moving goalposts. Authors worry (reasonably) that changing models changes the story.

A way out is to specify update rules up front:

Which new models qualify for inclusion (top-3 by a public leaderboard, or any model exceeding a capability threshold).
What stays fixed (task set, scoring rubric, error taxonomy).
What can vary (prompt budget, tool set) and why.

This does not eliminate change. It makes change interpretable.

4) Report trendlines and uncertainty, not just point estimates

Instead of a single headline number, trend-aware papers can report:

performance across multiple model releases,
confidence intervals or bootstrap ranges,
sensitivity to prompt variants,
error distributions.

When the inevitable capability jump arrives, your paper still tells the reader what moved and what did not.

The reviewer dilemma is real, but norms can evolve

Mollick notes there are "no real norms" around this loop yet. Reviewers are stuck too. If they ignore new work and new models, they risk letting obsolete claims into the record. If they demand full refreshes, they increase delay and burden.

Some emerging norms that could help, without compromising rigor:

Clear versioning: require authors to specify model versions, dates, and access methods as prominently as datasets.
Bounded updating: allow a "results refresh" section that updates core tables without forcing a full rewrite.
Artifact expectations: treat reproducible pipelines as part of the scholarly contribution.
Citation fairness: recognize that being built-on quickly is impact, not a penalty, and avoid review demands that effectively punish early sharing.

None of this is trivial, especially given the tenure context Mollick flags. But the alternative is a growing gap between what the literature says and what the technology can do.

Designing for generalizability is now the main job

Mollick ends with a straightforward challenge: "Authors need to think really hard in advance about how to build generalizable papers that will hold up over time." I read this as a call to shift from publishing static claims about a moving target to publishing frameworks that can accommodate motion.

In AI, the question is less "What can model X do today?" and more "What changes performance over time, and what stays stubbornly hard?" Papers that answer the second question will age better, review better, and ultimately help the field make sense of progress rather than chase it.

This blog post expands on a viral LinkedIn post by Ethan Mollick, Associate Professor at The Wharton School. Author of Co-Intelligence. View the original LinkedIn post →