🧑🚀 Luke Chadwick on Agents That Quit After Hours
A response to 🧑🚀 Luke Chadwick on why AI agents stall after a few hours and how to run longer open-source agent workflows.
🧑🚀 Luke Chadwick recently shared something that caught my attention: "More and more stories of agents being given complex whole projects and left to run. Most of the available tooling though will work for a few hours at most... Is anyone out running bigger/longer projects on open-source tooling?" That question lands because it describes the exact gap many teams feel right now: we can demo autonomy, but sustaining it across a real project timeline is a different sport.
"Most of the available tooling... will work for a few hours at most."
In this post, I want to expand on Luke's point and make it practical: why agent runs degrade over time, what changes when the unit of work becomes a whole project, and what an open-source oriented stack looks like if you actually want multi-day reliability.
The problem Luke is pointing at: autonomy is easy, endurance is hard
Lots of agent tooling looks great in a 10 minute clip: plan, call tools, write code, ship a patch. The trouble starts when you give an agent a whole project and walk away. Over hours, the system accumulates small failures that would be harmless in a short run but become fatal at day scale.
Luke's "Ralph Wiggum loop" reference is perfect: once an agent loses its place, it can get stuck repeating the same half-broken attempt, confidently burning tokens and time without making progress.
Why "a few hours" is the common ceiling
In my experience, long-running agent projects fail for a handful of predictable reasons:
-
Context drift and goal dilution
Even with summaries, the agent gradually loses the crisp definition of "done". It starts optimizing for activity instead of outcomes. -
Non-idempotent tool use
An agent reruns a command, re-applies a patch, re-opens a PR, or re-scrapes data. Without idempotency and state checks, retries multiply damage. -
Weak state and memory design
If your only memory is a chat transcript, you are one truncation away from amnesia. If your memory is a vector store without strong structure, you get fuzzy recall at exactly the wrong times. -
Error handling that is fine for demos, not operations
Transient network errors, rate limits, flaky tests, and tool timeouts are normal. "Try again" is not a strategy unless you add backoff, budgets, and escalating recovery paths. -
No observability, no control
If you cannot answer "what is it doing right now and why?" you cannot intervene before the loop eats the day.
Key insight: long-running agents are not primarily a prompting problem. They are a systems problem.
What changes when the agent owns a whole project
When you assign an agent a complete project, you are implicitly asking it to do project management. That includes:
- Decomposing work into a backlog
- Tracking dependencies and partial completion
- Handling interruptions and resuming cleanly
- Proving progress with artifacts (commits, tests, docs)
- Knowing when to ask for help
The agent needs a durable notion of state that lives outside the model. Think "workflow engine" more than "chat loop".
A practical open-source approach to longer runs
Luke specifically asked about open-source tooling, so here is a pattern that I have seen hold up better than the usual single-loop agent.
1) Put a workflow engine in charge
Instead of letting the LLM drive everything, put a deterministic orchestrator in charge and let the model propose actions.
Open-source options to consider:
- Temporal (durable workflows, retries, timeouts, long-running activities)
- Prefect or Dagster (dataflow-style orchestration, scheduling, retries)
- Apache Airflow (batch workflows, less interactive but proven)
Why this matters: if the agent crashes, the workflow resumes. If a step fails, you can retry safely with guardrails.
2) Make state explicit and testable
Store structured state in a database, not just in prompts.
At minimum, keep:
- Current objective and acceptance criteria
- Task graph (queued, running, blocked, done)
- Tool outputs that matter (paths, URLs, hashes, command logs)
- Decisions and rationale (short, structured)
Postgres or SQLite is often enough. The model should read from this state and write back to it in a schema you can validate.
3) Build a loop that cannot spin forever
A "Ralph Wiggum loop" is usually a missing circuit breaker. Add explicit budgets and stop conditions:
- Max retries per step, with exponential backoff
- Max token budget per task
- Max wall-clock time per task
- A rule like: after N failures, escalate to human review with a compact incident report
The incident report should include: last successful checkpoint, failing command, error snippet, and the agent's proposed next action.
4) Separate planning from execution
Many loops fail because planning and execution happen in the same breath. A more stable pattern:
- Planner produces a short plan and writes it to state
- Executor runs one step only
- Critic verifies the result against acceptance criteria
- Orchestrator decides: continue, retry, re-plan, or escalate
You can implement this with open-source agent frameworks (for example LangGraph-style state machines), but the key is the separation of concerns.
5) Treat tools like production integrations
Long projects touch real systems: git, CI, package registries, cloud APIs. Make tool calls observable and replayable.
Helpful open-source pieces:
- Langfuse for tracing and prompt/tool observability
- OpenTelemetry for unified logs, traces, and metrics
- A simple "tool gateway" service that logs inputs/outputs and enforces policies
Policies matter. Example: the agent can run tests and create branches, but merging to main requires approval.
6) Use sandboxes and artifacts to prove progress
For coding agents, the best antidote to drift is an artifact pipeline:
- Every change is a commit
- Every claim is backed by a test result or a reproducible command
- The agent must keep a "runbook" file updated: how to build, test, and reproduce
Run the agent in a sandbox (Docker is usually sufficient) so you can restart without contaminating environments.
What "bigger/longer" looks like in practice
If I were answering Luke's question directly, I would define "longer" as:
- The system can run unattended overnight
- It can recover from transient failures
- It can resume after a restart
- It produces measurable progress checkpoints
- It stops safely when uncertain
That is less about making the model smarter, and more about making the surrounding system resilient.
Here is a simple blueprint for a long-running open-source agent setup:
- Orchestrator: Temporal (or Prefect) owns retries, schedules, and timeouts
- Agent logic: a state machine (graph) that runs plan -> act -> verify cycles
- State: Postgres for tasks, decisions, and checkpoints
- Workspace: Docker sandbox per job, with mounted repo
- Observability: Langfuse + OpenTelemetry logs
- Safety: budgets, circuit breakers, and human escalation
A checklist to avoid the "few hours" cliff
If your agents currently stall after a couple of hours, these changes usually move the needle quickly:
- Define "done" as tests passing, a PR created, or a report generated, not "looks good"
- Make every step idempotent (check before you change)
- Save checkpoints after each successful step
- Limit retries and force escalation after repeated failures
- Add a verifier that can say "this did not work" even when the model is confident
- Log every tool call and capture outputs
- Periodically re-anchor the agent by reloading the objective and acceptance criteria from state
If you can replay an agent run step-by-step, you can debug it. If you cannot replay it, you will eventually fear it.
Closing the loop with Luke's question
Luke asked: "Is anyone out running bigger/longer projects on open-source tooling?" I think the honest answer is: yes, but not by relying on a single agent loop. The teams getting endurance are borrowing patterns from distributed systems and workflow engineering: durable state, explicit orchestration, tight verification, and strong stop conditions.
If you are experimenting here, I would love to see more public write-ups that go beyond "it worked once" and instead report: runtime, failure modes, recovery strategy, and how progress was measured. That is the kind of detail that will turn agent hype into agent operations.
This blog post expands on a viral LinkedIn post by 🧑🚀 Luke Chadwick. View the original LinkedIn post →