🧑‍🚀 Luke Chadwick recently shared something that caught my attention: "More and more stories of agents being given complex whole projects and left to run. Most of the available tooling though will work for a few hours at most... Is anyone out running bigger/longer projects on open-source tooling?" That question lands because it describes the exact gap many teams feel right now: we can demo autonomy, but sustaining it across a real project timeline is a different sport.

"Most of the available tooling... will work for a few hours at most."

In this post, I want to expand on Luke's point and make it practical: why agent runs degrade over time, what changes when the unit of work becomes a whole project, and what an open-source oriented stack looks like if you actually want multi-day reliability.

The problem Luke is pointing at: autonomy is easy, endurance is hard

Lots of agent tooling looks great in a 10 minute clip: plan, call tools, write code, ship a patch. The trouble starts when you give an agent a whole project and walk away. Over hours, the system accumulates small failures that would be harmless in a short run but become fatal at day scale.

Luke's "Ralph Wiggum loop" reference is perfect: once an agent loses its place, it can get stuck repeating the same half-broken attempt, confidently burning tokens and time without making progress.

Why "a few hours" is the common ceiling

In my experience, long-running agent projects fail for a handful of predictable reasons:

Context drift and goal dilution
Even with summaries, the agent gradually loses the crisp definition of "done". It starts optimizing for activity instead of outcomes.
Non-idempotent tool use
An agent reruns a command, re-applies a patch, re-opens a PR, or re-scrapes data. Without idempotency and state checks, retries multiply damage.
Weak state and memory design
If your only memory is a chat transcript, you are one truncation away from amnesia. If your memory is a vector store without strong structure, you get fuzzy recall at exactly the wrong times.
Error handling that is fine for demos, not operations
Transient network errors, rate limits, flaky tests, and tool timeouts are normal. "Try again" is not a strategy unless you add backoff, budgets, and escalating recovery paths.
No observability, no control
If you cannot answer "what is it doing right now and why?" you cannot intervene before the loop eats the day.

Key insight: long-running agents are not primarily a prompting problem. They are a systems problem.

What changes when the agent owns a whole project

When you assign an agent a complete project, you are implicitly asking it to do project management. That includes:

Decomposing work into a backlog
Tracking dependencies and partial completion
Handling interruptions and resuming cleanly
Proving progress with artifacts (commits, tests, docs)
Knowing when to ask for help

The agent needs a durable notion of state that lives outside the model. Think "workflow engine" more than "chat loop".

A practical open-source approach to longer runs

Luke specifically asked about open-source tooling, so here is a pattern that I have seen hold up better than the usual single-loop agent.

1) Put a workflow engine in charge

Instead of letting the LLM drive everything, put a deterministic orchestrator in charge and let the model propose actions.

Open-source options to consider:

Temporal (durable workflows, retries, timeouts, long-running activities)
Prefect or Dagster (dataflow-style orchestration, scheduling, retries)
Apache Airflow (batch workflows, less interactive but proven)

Why this matters: if the agent crashes, the workflow resumes. If a step fails, you can retry safely with guardrails.

2) Make state explicit and testable

Store structured state in a database, not just in prompts.

At minimum, keep:

Current objective and acceptance criteria
Task graph (queued, running, blocked, done)
Tool outputs that matter (paths, URLs, hashes, command logs)
Decisions and rationale (short, structured)

Postgres or SQLite is often enough. The model should read from this state and write back to it in a schema you can validate.

3) Build a loop that cannot spin forever

A "Ralph Wiggum loop" is usually a missing circuit breaker. Add explicit budgets and stop conditions:

Max retries per step, with exponential backoff
Max token budget per task
Max wall-clock time per task
A rule like: after N failures, escalate to human review with a compact incident report

The incident report should include: last successful checkpoint, failing command, error snippet, and the agent's proposed next action.

4) Separate planning from execution

Many loops fail because planning and execution happen in the same breath. A more stable pattern:

Planner produces a short plan and writes it to state
Executor runs one step only
Critic verifies the result against acceptance criteria
Orchestrator decides: continue, retry, re-plan, or escalate

You can implement this with open-source agent frameworks (for example LangGraph-style state machines), but the key is the separation of concerns.

5) Treat tools like production integrations

Long projects touch real systems: git, CI, package registries, cloud APIs. Make tool calls observable and replayable.

Helpful open-source pieces:

Langfuse for tracing and prompt/tool observability
OpenTelemetry for unified logs, traces, and metrics
A simple "tool gateway" service that logs inputs/outputs and enforces policies

Policies matter. Example: the agent can run tests and create branches, but merging to main requires approval.

6) Use sandboxes and artifacts to prove progress

For coding agents, the best antidote to drift is an artifact pipeline:

Every change is a commit
Every claim is backed by a test result or a reproducible command
The agent must keep a "runbook" file updated: how to build, test, and reproduce

Run the agent in a sandbox (Docker is usually sufficient) so you can restart without contaminating environments.

What "bigger/longer" looks like in practice

If I were answering Luke's question directly, I would define "longer" as:

The system can run unattended overnight
It can recover from transient failures
It can resume after a restart
It produces measurable progress checkpoints
It stops safely when uncertain

That is less about making the model smarter, and more about making the surrounding system resilient.

Here is a simple blueprint for a long-running open-source agent setup:

Orchestrator: Temporal (or Prefect) owns retries, schedules, and timeouts
Agent logic: a state machine (graph) that runs plan -> act -> verify cycles
State: Postgres for tasks, decisions, and checkpoints
Workspace: Docker sandbox per job, with mounted repo
Observability: Langfuse + OpenTelemetry logs
Safety: budgets, circuit breakers, and human escalation

A checklist to avoid the "few hours" cliff

If your agents currently stall after a couple of hours, these changes usually move the needle quickly:

Define "done" as tests passing, a PR created, or a report generated, not "looks good"
Make every step idempotent (check before you change)
Save checkpoints after each successful step
Limit retries and force escalation after repeated failures
Add a verifier that can say "this did not work" even when the model is confident
Log every tool call and capture outputs
Periodically re-anchor the agent by reloading the objective and acceptance criteria from state

If you can replay an agent run step-by-step, you can debug it. If you cannot replay it, you will eventually fear it.

Closing the loop with Luke's question

Luke asked: "Is anyone out running bigger/longer projects on open-source tooling?" I think the honest answer is: yes, but not by relying on a single agent loop. The teams getting endurance are borrowing patterns from distributed systems and workflow engineering: durable state, explicit orchestration, tight verification, and strong stop conditions.

If you are experimenting here, I would love to see more public write-ups that go beyond "it worked once" and instead report: runtime, failure modes, recovery strategy, and how progress was measured. That is the kind of detail that will turn agent hype into agent operations.

This blog post expands on a viral LinkedIn post by 🧑‍🚀 Luke Chadwick. View the original LinkedIn post →