Why LLM Agents Fail in Production (And What to Do About It)

LLM agents are having a moment. Every week there's a new framework, a new demo, a new claim that AGI-adjacent productivity is just one API call away. But if you've tried to ship a real agent to production, you know the gap between "impressive demo" and "reliable product" is enormous.

After spending the last year building agents that actually had to work—not just wow people in presentations—I've catalogued the most common failure modes and what I've learned to do about them.

The Core Problem: Stochastic Systems in Deterministic Contexts

The fundamental tension is this: LLMs are probabilistic. Products need to be reliable. These two things don't naturally coexist.

When you chain several LLM calls together, each one has some probability of doing something unexpected. Even if each step is 95% reliable, five chained steps give you roughly 77% reliability. Ten steps gets you to 60%. That's not a product—that's a demo.

Failure Mode 1: Context Window Drift

Agents accumulate context over multiple steps. The problem is that early instructions get "forgotten" as more content fills the window. I've seen agents that start with clear goals become increasingly confused by step 8 because the original task is buried under thousands of tokens of tool outputs.

Fix: Restate the original goal explicitly in your system prompt at every step. Don't rely on the model to remember what it's doing from the initial message.

Failure Mode 2: Tool Call Hallucination

Models will call tools with invented parameters when they're uncertain. They'll pass a user ID that doesn't exist, call an endpoint with a field name that's slightly wrong, or retry a failed operation with the exact same (broken) inputs.

Fix: Use strict JSON schema validation on every tool call. Return structured error messages that tell the model specifically what was wrong and what it should try instead. Never just pass an error string—you'll get an apology, not a correction.

Failure Mode 3: Infinite Loops

Without explicit loop detection, agents will get stuck retrying the same approach indefinitely. I've had agents make 40 identical tool calls because each failure response was ambiguous enough that the model thought it was making progress.

Fix: Track state explicitly. Keep a list of what has been tried. Inject that list into the context on each step so the model can see it isn't making progress.

Failure Mode 4: Scope Creep

Given enough steps and broad enough tools, agents will start doing things you didn't ask them to do. An agent asked to "summarize my emails" will start composing replies. An agent asked to "check the database" will start making modifications.

Fix: Explicit, narrow tool definitions. Separate read and write tools. Add confirmation gates before any destructive or irreversible action. Treat every write operation as requiring explicit user intent.

What Actually Helps

The agents that work in production share a few properties:

Short chains. More than 5 steps is a risk. More than 10 is almost always a mistake.
Structured outputs throughout. Not just at the end—every intermediate step should produce parseable output.
Human-in-the-loop at branch points. Let the model do the reasoning; let the human confirm the direction at key decisions.
Evals, evals, evals. You cannot improve what you can't measure. Build your eval suite before you build the agent.

Agents are genuinely powerful—but they require a different engineering discipline than traditional software. The cost of a bug isn't a compiler error; it's a plausible-sounding wrong answer that you might not catch for hours.

Build accordingly.