
Ethan Mollick and the Nature Claim: AGI Already?
A response to Ethan Mollick's viral post on a Nature claim that today's LLMs are AGI, and what that means for AI work.
Ethan Mollick recently shared something that caught my attention: a "pretty bold comment in Nature" by linguists, computer scientists, and philosophers claiming that AGI has already arrived in current large language models.
He quoted the line that makes your eyebrows go up:
"By reasonable standards, including Turing’s own, we have artificial systems that are generally intelligent. The long-standing problem of creating AGI has been solved."
When a group of serious researchers says that out loud, it forces a question many of us have been skating around: if you take the original spirit of the Turing Test and a practical view of generality, do today’s LLM-based systems count as "generally intelligent" in a meaningful sense? And if they do, what changes about how we evaluate, govern, and deploy them?
The bold claim Mollick highlighted (and why it matters)
Mollick’s post is short, but it points at a major shift: the argument that AGI is not a single future milestone but something we might have crossed gradually, with models that can do a wide range of tasks at a level that surprises experts.
The Nature quote makes two moves:
-
It appeals to "reasonable standards" rather than a single canonical test.
-
It invokes "Turing’s own" framing, suggesting that if the system’s behavior is broadly indistinguishable from intelligent behavior across conversation and problem solving, then we should treat it as intelligent.
That matters because the "AGI" label is not just semantics. It affects:
- Public expectations (utopia or doom timelines)
- Policy urgency (export controls, evaluation mandates, liability)
- Corporate strategy (what gets built, bought, or replaced)
- Safety posture (what risks are assumed to be possible now)
What do we mean by AGI, practically?
A lot of AGI debates stall because people use different definitions. In practice, most definitions combine a few ideas:
- Breadth: capability across many domains, not a single narrow task
- Transfer: ability to apply knowledge to new problems with limited instruction
- Adaptation: learning or updating strategies when circumstances change
- Autonomy: setting sub-goals and executing multi-step plans
- Robustness: not falling apart under small changes, ambiguity, or adversarial prompts
LLMs have made big jumps on breadth and transfer. They can write, summarize, code, tutor, plan trips, draft contracts, explain medical concepts (with caveats), and role-play as specialists. This is why the Nature comment, as Mollick notes, lands as "bold" but not obviously absurd.
The sticking points are usually robustness and autonomy, plus an argument that "general" should include grounded interaction with the physical world. But even those lines are getting blurry as models connect to tools, memory, and sensors.
The Turing Test angle: what would Turing count as success?
Invoking Turing is clever because it redirects the conversation from "does it think" to "what can it do in interaction." Turing’s imitation game was not a consciousness test. It was a behavioral test: can a machine carry a conversation well enough that a judge cannot reliably tell it from a human?
In 2026, many people have had the experience of chatting with an LLM and thinking, "If I did not know, I might believe this was a person." That does not settle AGI, but it does validate the core point: behaviorally, something important has changed.
At the same time, Turing’s framing has a loophole that modern systems exploit: fluent language can mask weak understanding. The model may sound like it "gets it" while hallucinating, miscounting, or failing on edge cases.
So the question becomes: do we judge intelligence by typical performance, or by worst-case brittleness?
Capability is not reliability (and this is where the debate gets real)
Here is the most useful way I have found to hold both truths at once:
Today’s frontier models can be astonishingly capable and still not be dependable.
In many real settings, a system that is right 85 percent of the time is transformative if you can verify, constrain, and iterate. But it is dangerous if you treat it like a fully trusted agent.
Examples where LLMs look "general":
- Writing and revising across genres and audiences
- Explaining concepts at different levels (child, student, expert)
- Translating intent into code and debugging with feedback
- Combining scattered information into a coherent plan
Examples where LLMs still fail in ways humans find unintelligent:
- Confidently inventing citations or details under pressure
- Losing track of constraints in long multi-step tasks
- Being highly sensitive to wording, context, or hidden assumptions
- Struggling with novel situations that require real-time grounding
So if someone declares "AGI is solved," you should immediately ask: solved for which definition, under what error tolerance, with what safeguards, and in what environments?
If we accept the claim, what changes tomorrow?
Mollick’s post implicitly invites a second-order question: even if you buy the Nature argument, do we do anything differently?
I think we do, but not in the cinematic way.
1) Evaluation shifts from benchmarks to audits
Benchmarks are useful, but they are not enough. If systems are "generally intelligent" in the sense of broad competence, evaluation has to look like:
- Red-team testing by domain experts
- Long-horizon task trials (hours or days, not minutes)
- Tool-use and agentic behavior assessments
- Incident reporting and postmortems, like security engineering
2) Workplace adoption becomes a management problem, not a model problem
If you treat LLMs as generally capable collaborators, the bottleneck becomes:
- Training people to delegate well
- Building verification into workflows
- Defining what cannot be automated (accountability, approvals, risk)
- Measuring productivity honestly (time saved versus errors introduced)
In other words, the "AGI" moment might look less like a lab announcement and more like a slow restructuring of jobs around human plus AI teams.
3) Policy needs sharper categories than "AGI"
Policymakers like labels, but "AGI" is a blunt instrument. Regulation may need to focus on specific risk profiles:
- Systems that can autonomously execute transactions
- Systems used in hiring, lending, healthcare, or legal decisions
- Systems that can generate persuasive content at scale
- Systems that can discover vulnerabilities or design harmful agents
Calling something AGI does not tell you where the harm is. Capabilities plus deployment context does.
If we reject the claim, what are we protecting?
It is also worth asking why some researchers resist the AGI framing.
Sometimes it is because "general" is taken to mean human-level across essentially everything, including common sense, physical reasoning, and stable long-term autonomy.
Sometimes it is a strategic concern: declaring AGI achieved can trigger hype cycles, policy overreactions, or risky deployment races.
And sometimes it is simply scientific caution: language behavior is a limited window into cognition, and we should not confuse fluency with understanding.
All of those are reasonable. But the reason Mollick’s highlight spreads is that many people feel the ground moving. The old mental model of "narrow AI" does not fit anymore.
A more actionable way to phrase the question
Instead of asking "Is this AGI?" I prefer asking:
- What tasks can the system generalize to with minimal instruction?
- How does performance degrade under ambiguity, novelty, and adversarial pressure?
- What happens when the system is given tools, memory, and permission to act?
- Can humans reliably supervise it at the speed it operates?
If the answers make you uneasy, then regardless of the label, you are living in the world Mollick is pointing at.
Where I land after reading Mollick’s prompt
I read Mollick’s post less as "AGI is here, full stop" and more as an invitation to update our seriousness. If credible scholars are willing to argue in Nature that "by reasonable standards" AGI is achieved, then the burden shifts onto the rest of us to specify our standards, test them, and design institutions around what these systems can already do.
Whether you call it AGI, general-purpose AI, or simply "LLMs plus tools," the practical takeaway is the same: treat capability as real, treat reliability as conditional, and build the human systems that keep both in balance.
This blog post expands on a viral LinkedIn post by Ethan Mollick, Associate Professor at The Wharton School. Author of Co-Intelligence. View the original LinkedIn post →