DEV Community

Cover image for Why AI Agents Keep Failing in Production (And How I Fixed It)
Leonidas Williamson
Leonidas Williamson

Posted on

Why AI Agents Keep Failing in Production (And How I Fixed It)

I spent 6 months building AI agents. They kept dying. So I built the infrastructure to keep them alive.

Last year, I was excited about AI agents. I built a research agent that could search the web, summarize papers, and draft reports. It worked great in demos.
Then I tried to run it in production.
Within a week:

It crashed 47 times
A single runaway loop cost me $340 in API calls
A multi-step research task failed halfway through, leaving corrupted state
I had no idea what any of my agents were actually doing

Sound familiar?
I realized the problem wasn't the agents themselves. It was that we have no infrastructure for running agents reliably.
So I built one.

The Problem: Agents Are Fragile
Here's what nobody tells you about AI agents:

  1. They crash. A lot. Network timeouts. Rate limits. Malformed responses. Context window overflows. An agent that works 99% of the time will fail multiple times per day at scale.
  2. Multi-step tasks are disasters waiting to happen. Your agent is on step 7 of 10. It crashes. What now? Do you restart from the beginning? Do you have step 6's output saved? Can you even tell what step it was on?
  3. Costs are invisible until they're catastrophic. One bad prompt, one infinite loop, one overly curious agent — and you're staring at a $500 bill from a task that should have cost $0.50.
  4. You're flying blind. What's your agent doing right now? Which step is it on? How much has it spent? Is it stuck? Most agent frameworks give you zero visibility.

The Solution: Orchestration

I looked at how other industries solved similar problems:

Telecom had Erlang/OTP — supervisors that restart crashed processes automatically

Finance had the Saga pattern — multi-step transactions that roll back cleanly on failure

Infrastructure had Kubernetes — orchestration for containers with health checks and auto-healing

AI agents had... nothing.

So I built Nexus OS — an orchestration layer that brings these battle-tested patterns to AI agents.

What Nexus OS Does

  1. Supervisors (Stolen from Erlang) In Erlang, processes crash all the time. That's fine — supervisors restart them automatically. The system stays up even when individual pieces fail. Nexus brings this to agents: yamlsupervisor: name: research-team strategy: one-for-one # Only restart the agent that crashed agents:
    • researcher
    • writer
    • reviewer maxRestarts: 5 withinSeconds: 60 Three restart strategies:

one-for-one: Only restart the crashed agent
one-for-all: If one crashes, restart all (for tightly coupled agents)
rest-for-one: Restart the crashed agent and all agents started after it

Your agents will crash. Supervisors make that okay.

  1. Sagas (Stolen from Distributed Systems) A saga is a sequence of steps where each step has a compensation action. If step 5 fails, you run compensations for steps 4, 3, 2, 1 — in reverse order. yamlsaga: name: publish-article steps:
    • name: research action: research-agent compensation: delete-research-notes
- name: draft
  action: writing-agent
  compensation: delete-draft

- name: review
  action: review-agent
  compensation: revert-review

- name: publish
  action: publish-agent
  compensation: unpublish
Enter fullscreen mode Exit fullscreen mode

If publishing fails, the article gets unpublished, the review gets reverted, the draft gets deleted, and the research notes get cleaned up. Automatically.
No more corrupted state from half-finished tasks.

  1. Cost Controllers (Because $500 Surprises Suck)
    Every agent gets a budget. When they hit it, you decide what happens:
    yamlcost:
    agent: research-bot
    budget:
    maxTokens: 100000
    maxDollars: 5.00
    onLimit: pause # or: throttle, alert, kill
    Real-time tracking. Hard limits. No more surprise bills.

  2. Pools (For Parallel Work)
    Fan out work to multiple agents, merge the results:
    yamlpool:
    name: research-pool
    agents:

    • researcher-1
    • researcher-2
    • researcher-3 strategy: majority # Return when 2/3 agree Strategies:

all: Wait for everyone
first: Return the fastest response
majority: Wait for >50% agreement
quorum: Custom threshold

  1. AXIS Trust (Identity for Agents) This one's different. I built a separate system called AXIS Trust for agent identity and reputation. Every agent gets:

AUID: A unique identifier
Trust Score: 0-100 based on behavior
Credit Rating: AAA to D

Before an agent runs, Nexus can verify its trust level:
yamltrust:
provider: axis
requirements:
minTrustTier: T3
minCreditRating: BBB
enforcement:
onUntrusted: reject
As agents start interacting with each other (and with money), trust infrastructure becomes critical.

The Technical Decisions
Why Rust?

Single binary: No runtime, no dependencies. Download and run.
Performance: Orchestration needs to be fast and lightweight
WASM support: Agents run in sandboxed WASM containers via wasmtime
Memory safety: Long-running processes can't afford memory leaks

The entire binary is ~10MB.
Why WASM Sandboxing?
Agents run arbitrary code. That's terrifying.
WASM gives us:

Memory isolation

CPU time limits
No filesystem access (unless explicitly granted)
No network access (unless explicitly granted)

An agent can't rm -rf /. It can't exfiltrate data. It can only do what you allow.

Why YAML Config?

Controversial take: YAML is fine.
For infrastructure configuration, YAML is readable, diffable, and familiar. Your orchestration config should live in version control alongside your code.

Getting Started

Install:

cargo install --git https://github.com/leonidas-esquire/nexus-os.git

Don't have Rust? Install it first:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Create a project:

naos init my-project
cd my-project

Create an agent:

naos create researcher --template research

Run it:

naos run researcher

See what's happening:

naos dashboard

This opens a web UI at localhost:4200 showing all your agents, their status, costs, and logs.

What I Learned Building This

  1. Production is a different planet
    The gap between "works in a notebook" and "runs reliably in production" is massive. Most agent frameworks are optimized for the notebook. Nexus is optimized for production.

  2. Erlang got it right 40 years ago
    The "let it crash" philosophy with supervisor trees is brilliant. Instead of trying to handle every possible error, you accept that crashes happen and build systems that recover automatically.

  3. Visibility is a feature
    Half of "reliability" is just knowing what's happening. A dashboard that shows agent status, costs, and logs in real-time is worth more than clever error handling.

  4. Cost controls aren't optional
    AI agents with access to paid APIs are like employees with company credit cards. You need limits, tracking, and alerts. This should be built into the infrastructure, not bolted on.

What's Next

Nexus OS is open source (Apache 2.0). The roadmap:
Now:

Core orchestration (supervisors, sagas, workflows, pools)
Cost controls
AXIS Trust integration
Web dashboard

Coming soon:

WASM skill marketplace (reusable agent capabilities, devs earn money)
TypeScript SDK
Multi-node clustering

Later:

Managed cloud offering
Enterprise features (SSO, RBAC, audit logs)

Try It

GitHub: github.com/leonidas-esquire/nexus-os
Docs: aiagents.nexus/docs
Website: aiagents.nexus

I'd love feedback — especially on the API design and what orchestration patterns you'd want to see.

If you've struggled with keeping AI agents running in production, give Nexus a try.

And if you have war stories about agent failures, I'd love to hear them in the comments.

Building something with AI agents? I write about agent infrastructure, reliability patterns, and lessons learned.

Top comments (4)

Collapse
 
automate-archit profile image
Archit Mittal

The production failure patterns you describe are painfully accurate. In my experience building automation workflows with AI agents, the #1 killer is error cascading - when one tool call fails and the agent tries to 'recover' by making increasingly wrong decisions instead of gracefully degrading. The fix that worked best for me was implementing explicit checkpoint/rollback semantics - every agent action gets a snapshot, and on failure you roll back to the last known good state rather than letting the LLM improvise a recovery. Also, structured output validation between every step catches hallucinated parameters before they hit your APIs.

Collapse
 
leonidasesquire profile image
Leonidas Williamson

Thanks Archit — error cascading is exactly the nightmare scenario that pushed me to build this.

You nailed it: letting the LLM improvise a recovery is asking for trouble. They'll confidently make things worse.

The checkpoint/rollback pattern you describe is essentially what Sagas do in Nexus — every step gets a compensation action, and on failure you unwind cleanly instead of hoping the agent figures it out.

The structured output validation point is interesting. Right now Nexus validates at the orchestration layer (did the step succeed/fail), but validating the content of outputs between steps could catch hallucinated parameters before they propagate.

Would you want that as a built-in primitive, or more of a "validation agent" you wire into your workflow?

Curious what automation workflows you've been building.

Collapse
 
hollowhouse profile image
Hollow House Institute • Edited

This is a solid implementation of known reliability patterns.

The gap I still see is not orchestration. It is enforcement at execution time.

Supervisors restart.
Sagas roll back.
Pools coordinate.
Budgets limit spend.

All of that improves recovery and visibility.

It does not guarantee that the system is making valid decisions before execution.

Most failures I see are not from crashes. They come from:

  • acting on stale or superseded context
  • proceeding with low confidence outputs
  • crossing implicit boundaries during multi step tasks
  • propagating incorrect intermediate state

Those are not infra failures. They are control failures.

An agent can complete every step of a saga and still produce an invalid outcome if no boundary is enforced on what is allowed to execute.

What tends to be missing is:

  • decision boundaries defined before the agent runs
  • validation gates between steps, not just success or failure states
  • escalation when outputs fall outside expected ranges
  • stop conditions when state integrity is unclear

Orchestration answers how agents run.

Governance answers whether they should proceed at all.

Without that layer, systems become better at continuing, not better at being correct.

Collapse
 
leonidasesquire profile image
Leonidas Williamson

You're identifying the exact gap I've been working on.
Orchestration handles recovery. It doesn't guarantee correctness. An agent can complete every step of a saga successfully and still produce garbage if no one validated the intermediate outputs.
The failure modes you listed are the ones I see constantly:

Stale context that nobody invalidated
Low confidence outputs treated as facts
Implicit boundaries that were never made explicit
Intermediate state that propagated unchecked

This is why I'm building a Validator primitive that sits between workflow steps. The goal is enforcement that's fast, deterministic, and doesn't rely on the LLM to judge its own work:

Schema validation between steps (JSON Schema, type checks)
Inline rules for range checks, format validation, business constraints
Retry with feedback when validation fails, passing errors back so the agent can self correct

Escalation or hard stop when outputs fall outside expected bounds

The key distinction: structural validation (did the output match the schema) is cheap and deterministic. Semantic validation (is this output correct) is harder. But catching structural failures early prevents most downstream damage.
Your framing is useful:

Orchestration answers "how do agents run."
Validation answers "should this output proceed."
Governance is probably both, plus policy about what's allowed to execute at all.
Curious if you've seen systems that handle the governance layer well. Most of what I've encountered either skips it entirely or bolts it on after something breaks.