Why Your AI Chatbot Sucks: Common RAG Mistakes We See Every Week

Every week, someone asks us to look at their AI chatbot that is "not working well." The symptoms are always similar: the bot gives wrong answers, makes things up, contradicts itself, or responds with "I do not have information about that" when the information is clearly in the knowledge base. After auditing approximately 20 RAG-based chatbot deployments over the past year, we have identified 7 mistakes that appear in nearly all of them.

Mistake number one: chunks are too small or too large. This is the most common issue and the easiest to fix. Most tutorials recommend chunk sizes of 256 to 512 tokens. For many document types, this is either too small to preserve context or too large to enable precise retrieval.

When chunks are too small, retrieved passages lack the context needed to answer questions completely. A question like "what is the refund policy for enterprise customers" might retrieve a chunk that says "enterprise customers are eligible for refunds" without the adjacent text that specifies the conditions and timeframe. The LLM then either hallucinates the details or gives a vague answer.

When chunks are too large, the retrieval step becomes imprecise. A 2,000-token chunk might contain information about three different topics. If the question is about one of those topics, the LLM has to sort through irrelevant information in the chunk, which increases the chance of the model anchoring on the wrong section.

The fix is to use semantic chunking that respects document structure. Split on section boundaries, not arbitrary token counts. If a section is longer than 800 tokens, split it at paragraph boundaries. If a section is shorter than 200 tokens, merge it with the adjacent section. And always include a "context header" with each chunk that identifies the document title, section, and any parent headings. This simple metadata dramatically improves the LLM's ability to interpret the chunk correctly.

Mistake number two: no query transformation. Users do not ask questions the way documents are written. A user might ask "how do I reset my password" while the documentation says "to change your account credentials, navigate to the security settings panel." A direct embedding similarity search between the query and the chunks may not surface the relevant document because the vocabulary is different.

The fix is query transformation, also called query expansion or HyDE (Hypothetical Document Embeddings). Before searching the vector database, use a fast LLM call to rewrite the user query into a form that is more likely to match the document style. For the password example, the transformation might produce "change account credentials password security settings." This 50-cent-per-thousand-queries step improves retrieval accuracy by 15 to 30% in our testing.

Mistake number three: no reranking. Vector similarity search returns the chunks with embeddings most similar to the query embedding. But embedding similarity is not the same as relevance. A chunk that uses similar vocabulary but answers a different question will score highly. A chunk that answers the exact question using different vocabulary might score lower.

The fix is a reranking step. Retrieve 20 to 50 candidate chunks from the vector database, then use a cross-encoder reranking model like Cohere's rerank or a fine-tuned model to re-score them based on actual relevance to the query. The top 5 to 10 reranked chunks are then passed to the LLM. This consistently improves answer quality and costs very little in terms of latency and compute.

Mistake number four: stuffing all retrieved chunks into a single prompt. Most RAG implementations retrieve K chunks and concatenate them into the system prompt or context section. This creates two problems. First, the LLM does not know which chunk is most relevant and may anchor on the first chunk regardless of its relevance due to position bias. Second, if chunks contain contradictory information because the knowledge base has outdated documents, the LLM has no way to resolve the contradiction.

The fix is to present chunks with metadata and explicit source attribution. Instead of concatenating raw text, format each chunk with its source document, section title, date, and a relevance score. Instruct the LLM to prefer more recent sources when information conflicts and to cite specific sources in its response. This structured approach reduces hallucination and makes answers verifiable.

Mistake number five: no answer grounding verification. This is the mistake that causes the most user frustration. The LLM generates an answer that sounds authoritative but includes details not present in any retrieved chunk. This happens because LLMs are trained to be helpful, and when the retrieved context does not fully answer the question, the model fills in gaps from its training data.

The fix is a grounding verification step after generation. Use a separate, fast LLM call to check whether every claim in the generated answer is supported by the retrieved chunks. If a claim is not grounded, either remove it and note the gap, or flag it as uncertain. This adds 100 to 200ms of latency and a small cost, but it is the single most effective technique for reducing hallucination in RAG systems.

We implemented grounding verification for a financial services client whose chatbot was confidently providing incorrect regulatory information. Before verification, 18% of responses contained ungrounded claims. After verification, that number dropped to 3%, and the remaining 3% were caught by the verification step and flagged as uncertain rather than presented as fact.

Mistake number six: ignoring conversation history. Most RAG chatbot implementations treat each message independently, embedding and searching only the latest user message. But conversations have context. If a user asks "what are the pricing tiers" and then follows up with "what about the enterprise one," the second query alone does not contain enough information for effective retrieval.

The fix is conversation-aware query construction. Before embedding and searching, use the conversation history to construct a self-contained query. The follow-up "what about the enterprise one" becomes "what are the details of the enterprise pricing tier." This can be done with a simple LLM call that rewrites the latest message in the context of the conversation, or with a sliding window approach that includes the last 2 to 3 exchanges in the query.

Mistake number seven: no feedback loop. This is the strategic mistake that prevents chatbots from improving over time. Most teams deploy their RAG chatbot and never systematically measure how well it performs. They rely on anecdotal user complaints to identify issues, which means they only hear about the worst failures and miss the slow degradation in quality.

The fix is a multi-layered feedback system. At the minimum, implement thumbs up and thumbs down buttons on every response. Track which queries get negative feedback, which queries have no retrieved chunks above the relevance threshold, and which queries trigger the "I do not have information" fallback. Review these logs weekly. The patterns will show you exactly where your knowledge base has gaps, where your chunking strategy is failing, and where your retrieval is returning irrelevant results.

One of our clients went from a 62% user satisfaction rate to 89% over 8 weeks by fixing the feedback loop alone. They did not change their model, their embedding strategy, or their chunking. They simply started reviewing the failure cases weekly and adding the missing information to their knowledge base with proper chunking and metadata.

A summary of the fixes ranked by impact and ease of implementation. First, add grounding verification. Highest impact on answer quality, moderate implementation effort. Second, implement semantic chunking with context headers. High impact, low effort. Third, add a reranking step. High impact, very low effort if you use an API like Cohere. Fourth, implement query transformation. Moderate impact, low effort. Fifth, add conversation-aware query construction. Moderate impact, low effort. Sixth, structure chunk presentation with metadata. Moderate impact, low effort. Seventh, build a feedback loop and review process. Highest long-term impact, moderate ongoing effort.

Most RAG chatbots do not suck because RAG is a bad architecture. They suck because teams skip the engineering work between "it works in a demo" and "it works in production." Each of the fixes above adds a small amount of complexity and cost. Together, they transform a frustrating chatbot into one that users actually trust.

Related Articles

Gemini Flash Lite: The Underrated LLM That Powers Half Our Projects

The RAG Tax: Hidden Costs of Retrieval-Augmented Generation in Production

Why Claude Opus 4.5 Costs $75/MTok and Whether It's Worth Every Cent

Want to discuss this further?

Ready to build
something real?