Bjarn Brunenberg recently shared something that caught my attention: "Traditional experimentation as we know it is a shrinking skillset... The discipline isn't dying. But the "manual" version of it is." He also added that the experimenter role is shifting from the "doer" to the "operator".

That framing matters, because it cuts through a lot of AI noise. The question is not whether experimentation teams will keep running tests. The question is who (or what) does the repetitive work, and what the human role becomes when AI can draft, cluster, summarize, and generate faster than any individual contributor.

In this post, I want to expand on Bjarn's "4 levels of AI maturity for experimentation teams" and make it actionable. Think of it as a conversation with his model: what each level looks like in real workflows, what to watch out for, and how to move up a level without breaking quality, ethics, or trust.

"UPSKILLING IS NOT ABOUT PROMPTING; IT IS ABOUT ROLE REDESIGN" - Bjarn Brunenberg

Why "manual experimentation" is shrinking (but not experimentation)

Experimentation is a discipline: defining good hypotheses, designing valid tests, interpreting results, and turning learnings into decisions. That is not going away.

What is shrinking is the set of tasks we historically used to equate with being "good" at experimentation: writing every variant by hand, manually combing through interview notes, assembling QA checklists from scratch, building tracking plans in a doc, and spending hours on analysis writeups that follow the same structure every time.

AI is getting good at these repeatable patterns. So the value shifts upward:

From producing artifacts to producing decision quality
From individual output to system design
From speed at tasks to clarity in constraints and strategy

If you keep judging your team by how much they personally "do," you will underuse AI and overpay for manual labor. If you judge them by the quality of decisions and the strength of the experimentation system, the future looks a lot brighter.

The 4 levels of AI maturity for experimentation teams

Bjarn's four levels map nicely to how teams actually adopt AI. The biggest trap is thinking Level 4 is "more prompts." It's not. It's more operational design.

Level 1: AI-Aware (the "Side Project")

At Level 1, AI use is casual and disconnected from the core workflow. Someone occasionally asks a tool to rewrite a headline, brainstorm a few variations, or summarize a transcript.

This is useful, but it rarely changes outcomes because:

It is not repeatable (depends on who remembers to use it)
It is not connected to your research repository, analytics, or testing backlog
It does not reduce cycle time in a measurable way

If you are at Level 1, a simple next step is to standardize two or three "approved" use cases. For example:

Draft 10 copy variants from a defined messaging framework
Summarize one interview into a fixed template (pain, job-to-be-done, objections, triggers)
Generate a first-pass hypothesis statement from structured inputs

The goal is not to use AI more. It is to reduce randomness.

Level 2: AI-Enabled (the "Task Driver")

At Level 2, teams start using AI to automate boring tasks: summarizing batches of interviews, generating basic CSS for test variants, drafting an initial QA checklist, or producing an analysis narrative.

This typically creates speed, but not leverage. As Bjarn put it, you are "slightly faster" but still making every decision.

Two practical additions at this stage:

Create templates that AI fills: Instead of "write me a tracking plan," use a structured prompt or form that outputs events, properties, and validation steps in your chosen analytics format.
Add review gates: Decide where human review is mandatory (brand voice, legal, sensitive segments, statistical interpretation).

Level 2 teams often see productivity gains, but they also risk a new failure mode: shipping more experiments that are poorly grounded. Speed without quality control just increases the rate of mistakes.

Level 3: AI-Enhanced (the "Human-in-the-loop")

This is where Bjarn's model gets especially interesting: AI is embedded into the experimentation workflow itself.

He listed examples like:

Intake forms for hypotheses
Auto-clustering user feedback into themes
Drafting experiment plans, QA checklists, and tracking specs
Creating variant copy aligned to a messaging framework

I think of Level 3 as "workflow-native AI." Instead of asking a chatbot for help, your system produces drafts and suggestions at the right moment in the process.

Here is what Level 3 can look like in practice:

A Level 3 hypothesis intake flow

PM, marketer, or CRO specialist fills a short intake form (goal, audience, page, friction, evidence, constraints)
AI produces:
- A hypothesis in your preferred structure (If-then-because)
- A confidence score based on evidence completeness (not truth)
- Suggested metrics and risks
The experimenter reviews and edits, then routes to prioritization

A Level 3 research synthesis loop

User feedback (surveys, reviews, call transcripts) flows into a repository
AI clusters into themes (pricing confusion, trust objections, feature discoverability)
The experimenter validates clusters, labels them, and ties them to opportunities

This is where the experimenter becomes an operator: you direct strategy and validate context the machine misses, like brand nuance, ethics, and politics. AI can propose. Humans dispose.

A good Level 3 principle: if AI drafts it, the team must own it. Accountability stays human.

Level 4: AI-Driven (the "Supervisor")

Level 4 is not "hands off." It is "system on." Bjarn describes treating AI like an engine your team controls, with:

A living knowledge base of past tests, learnings, segments, objections
Guardrails (what we do not test, brand voice, legal constraints)
Measurement and learning loops (what we learned, what we do next)

At Level 4, you stop thinking in single experiments and start thinking in a decision factory.

What a Level 4 experimentation engine includes

Knowledge base: A searchable store of experiments, segments, hypotheses, outcomes, screenshots, and key learnings (including null results)
Guardrails: Clear constraints like "no fear-based health claims," "no dark patterns," "no unsubstantiated urgency," plus brand tone rules and regulatory requirements
A learning loop: Every test produces structured learnings that feed back into the system, influencing future recommendations and prioritization

Why the "Supervisor" framing is right

Supervision is about inputs and controls:

Are we feeding the engine high-quality evidence?
Are the constraints explicit and enforceable?
Are we measuring what matters and learning in a consistent format?

At this level, the experimenter's craft shifts toward orchestration: designing the system, auditing outputs, and ensuring the org uses learnings to make better decisions.

Upskilling that actually matches the shift (beyond prompting)

Bjarn's line about upskilling not being about prompting is the part I want more teams to internalize. Prompting is a tactic. Role redesign is the strategy.

Here are concrete skill upgrades that map to the operator and orchestrator future:

Systems thinking: defining repeatable workflows, templates, and feedback loops
Information architecture: structuring research and experiment data so AI and humans can retrieve it reliably
Decision hygiene: separating evidence quality from opinion, and making assumptions explicit
Risk and ethics: defining what not to test, plus escalation paths for sensitive outputs
Measurement discipline: ensuring tracking specs, validation, and analysis standards are consistent

If you want one simple starting point, it is this: treat every recurring experimentation artifact (hypothesis, plan, QA, tracking, analysis, learnings) as a product that can be systematized.

A practical roadmap to move up one level

If your team is trying to progress on Bjarn's maturity ladder, focus on one level-up, not a leap to Level 4.

From Level 1 to Level 2: standardize 2 to 3 AI-assisted tasks and measure time saved
From Level 2 to Level 3: embed AI into the workflow via forms, repositories, and templates with mandatory review gates
From Level 3 to Level 4: build the knowledge base and guardrails, then enforce a structured learning loop

The north star is not "more AI." The north star is better, faster decisions with fewer blind spots.

Closing thought

Bjarn Brunenberg's point that experimentation is not dying, only the manual version, is both a warning and an opportunity. If you keep defining the experimenter as the person who manually produces everything, the role will feel squeezed. If you redefine the role as an operator and orchestrator of an experimentation engine, your impact can expand.

This blog post expands on a viral LinkedIn post by Bjarn Brunenberg, Helping B2C Teams Accelerate Growth with Experimentation & AI | Freelance | 2× Award Winner | Keynote Speaker | Community Builder. View the original LinkedIn post →