AI Agents That Actually Work: What We Learned Building 8 of Them

The AI agent hype cycle is at its peak. Every startup is building agents. Every enterprise is buying agents. And most of them do not work reliably enough for production use. We know because we have built 8 production AI agents in the past 18 months, and 3 of them failed before we figured out what makes agents actually work in the real world.

An AI agent, for the purposes of this article, is a system where an LLM makes decisions about what actions to take, executes those actions through tool calls, observes the results, and decides what to do next. This is fundamentally different from a chatbot, which generates text, or a pipeline, which follows a fixed sequence of steps. The key characteristic of an agent is autonomous decision-making in a loop.

Here is what we learned across all 8 projects, organized into the patterns that work and the patterns that fail.

The first pattern that works: narrow scope with hard guardrails. Every successful agent we built does one thing well rather than many things poorly. Our most reliable agent is a customer support triage system for a SaaS company. It reads incoming support tickets, classifies them by urgency and category, routes them to the correct team, and drafts an initial response for a human to review. It does not resolve tickets autonomously. It does not access customer accounts. It does not make promises. Its scope is deliberately narrow.

This agent processes about 400 tickets per day with 94% routing accuracy and 87% response quality (as rated by the support team). It saves the client approximately 60 hours of support agent time per week. It works because the decision space is small, the actions are low-risk, and there is always a human in the loop before anything reaches the customer.

Compare that to our first failed agent: a "general purpose business assistant" that could search the company knowledge base, query databases, draft emails, schedule meetings, and update CRM records. The demo was spectacular. In production, it confidently scheduled meetings for the wrong time zone, drafted emails with fabricated details from the knowledge base, and updated CRM records based on misinterpreted instructions. The failure rate was around 15%, which sounds low until you realize that means 1 in 7 actions was wrong, and some of those wrong actions were sending incorrect emails to clients.

The lesson: agents should be specialists, not generalists. Build multiple narrow agents rather than one broad agent. Each agent should have a clearly defined set of actions, a limited decision space, and appropriate human oversight for its risk level.

The second pattern that works: deterministic scaffolding with LLM decision points. The best agents are not pure LLM loops. They are deterministic workflows with specific points where an LLM makes a decision. Think of it as a flowchart where some diamonds are code (if/else) and some diamonds are LLM calls (classify/decide).

Our most complex successful agent is a document processing system that handles insurance claims. The overall workflow is deterministic: receive document, extract text, classify document type, extract relevant fields, validate against business rules, route for human review or auto-approve. But within that deterministic workflow, steps 3 (classify), 4 (extract), and 6 (route) use LLM calls because these tasks require understanding natural language context.

This hybrid approach gives you the reliability of a traditional workflow with the flexibility of LLM reasoning at specific points. When the LLM makes a mistake at step 3, the deterministic validation at step 5 catches it. When the LLM is unsure at step 6, it defaults to human review rather than auto-approving.

The failed pattern is the unconstrained ReAct loop, where the LLM repeatedly decides what action to take next based on the full conversation history. These loops are non-deterministic by nature. The same input can produce different action sequences on different runs. Debugging is nightmarish because you cannot reproduce failures reliably. And the error propagation is catastrophic because a wrong decision at step 3 of a 10-step loop can cascade into 7 subsequent wrong decisions before anything checks the work.

The third pattern that works: aggressive output validation. Every tool call result should be validated before the agent acts on it. Every LLM output should be parsed and checked against expected schemas. Every external API response should be verified for sanity.

In our insurance claims agent, the field extraction step outputs structured JSON. Before that JSON is passed to the next step, it goes through a validation layer that checks for required fields, validates that dollar amounts are within reasonable ranges, verifies that dates are logically consistent (claim date is after policy start date), and flags any fields where the LLM expressed low confidence. About 8% of extractions are caught by this validation and routed to human review.

Without this validation, those 8% of cases would have been processed with incorrect data, resulting in wrong claim amounts, misidentified policy holders, or invalid dates. In insurance, those errors have real financial consequences.

The fourth pattern that works: comprehensive logging and observability. Every decision the agent makes, every tool call, every LLM prompt and response, and every validation result should be logged with full context. When an agent makes a wrong decision in production, you need to be able to reconstruct exactly what happened, what the agent saw, what it decided, and why.

We use structured logging with correlation IDs that trace a single agent run from start to finish. Each log entry includes the agent's state, the LLM prompt, the response, the parsed output, and the validation result. This means any production issue can be debugged by pulling up the logs for that run and stepping through the agent's decision process.

This logging infrastructure costs about $200 to $500 per month for our typical agent workloads, but it has saved us thousands in debugging time and client trust.

The fifth pattern that works: graceful degradation with human fallback. Every agent should have a clearly defined "I do not know" pathway. When the LLM is uncertain, when validation fails, when an API returns an unexpected result, the agent should hand off to a human rather than guessing.

Our support triage agent uses a confidence threshold. If the classification confidence is below 0.7, the ticket is routed to a human supervisor with the agent's best guess and reasoning attached. This means the supervisor gets a head start even when the agent cannot handle the ticket independently. The agent is still useful even when it fails, which is critical for user adoption. If agents only work when they are right and create problems when they are wrong, users will stop trusting them entirely.

The patterns that fail, summarized: unlimited scope, unconstrained loops, missing validation, poor observability, and binary success/failure without graceful degradation. If your agent demo uses a ReAct loop with 20 tools and no validation layer, it will fail in production. Guaranteed.

Here are the numbers from our 8 agents. The 5 successful agents have reliability rates between 89% and 97% depending on the complexity of the task. They all use deterministic scaffolding with LLM decision points. They all have validation layers. They all have human fallback paths. They handle a combined total of approximately 3,000 decisions per day.

The 3 failed agents were all broad-scope, unconstrained-loop designs. Two were scrapped entirely and replaced with narrow specialist agents. One was restructured into a deterministic workflow with LLM decision points, which took 6 weeks but resulted in a 92% reliability rate, up from approximately 70%.

Building agents that actually work is not about using the most powerful model or the most sophisticated prompting technique. It is about boring, disciplined engineering: narrow scope, deterministic scaffolding, aggressive validation, comprehensive logging, and graceful degradation. The same principles that make traditional software reliable make AI agents reliable. The LLM is just another component that can fail, and it should be treated as such.

Related Articles

Gemini Flash Lite: The Underrated LLM That Powers Half Our Projects

The RAG Tax: Hidden Costs of Retrieval-Augmented Generation in Production

Why Claude Opus 4.5 Costs $75/MTok and Whether It's Worth Every Cent

Want to discuss this further?

Ready to build
something real?