RAG Beyond the Basics: What Nobody Tells You

Building a RAG (Retrieval-Augmented Generation) pipeline is easy. You chunk your documents, embed them, store them in a vector database, retrieve the top-k on each query, and stuff them into a prompt. In an hour you have something that works.

The problem is "works" is doing a lot of heavy lifting in that sentence. It works sometimes. It works on clean, well-formatted documents. It works when the user asks exactly the right question.

Here's what I've learned building RAG systems that work reliably.

Chunking Is Not Trivial

Most tutorials chunk by fixed token count. This is simple and often wrong. A 512-token chunk might split a code example in half, cut off a table at an awkward boundary, or merge two conceptually unrelated sections.

Better approaches:

Semantic chunking: Split on meaning, not token count. Detect paragraph boundaries, section headers, and conceptual transitions.
Hierarchical chunking: Store both summary-level and detail-level chunks. Retrieve at the summary level first, then expand.
Document-aware chunking: Treat code differently from prose. Treat structured data (tables, lists) differently from narrative text.

The chunking strategy should match your retrieval strategy. If you're doing a semantic search, you want semantic chunks.

The Retrieval Gap

Vector similarity finds semantically similar text—not relevant text. These aren't the same thing.

"What is the refund policy?" and "I want to return a product" are semantically similar but ask about different policies. "Interest rate" and "interest" have high semantic similarity but could match completely different financial concepts.

Hybrid search (dense + sparse) solves a lot of this. BM25 or TF-IDF catches keyword matches that embedding search misses. Running both and merging the results almost always beats either alone.

Re-ranking is the other key piece. After retrieval, run a cross-encoder (or an LLM) to score each retrieved document against the actual query. The retrieval phase optimizes for recall; re-ranking optimizes for precision.

Context Window Management

Most RAG implementations dump all retrieved chunks into the context window without much thought. This causes two problems:

Lost in the middle: Models attend more to content at the beginning and end of the context. Relevant information buried in the middle gets underweighted.
Conflicting information: If you retrieve chunks from different time periods or sources, the model might average them or pick the wrong one.

Fix: Curate, don't concatenate. Pass the top 3 chunks, ordered by relevance, with clear separators. Include source metadata. If chunks conflict, surface that to the model explicitly.

Citation and Grounding

"Hallucination" in RAG usually isn't random—it's the model filling gaps between retrieved documents with plausible-sounding inference. The fix isn't better embeddings; it's better prompting.

Require the model to cite every claim with a chunk reference. Use structured output: {"claim": "...", "source_chunk_id": "..."}. If the model can't cite it, it shouldn't say it.

Evaluation Is the Hard Part

You can't improve a RAG system without measuring it. The metrics that matter:

Retrieval recall: Are the relevant chunks being retrieved at all?
Context precision: Are the irrelevant chunks contaminating the context?
Answer faithfulness: Does the answer only use information from the retrieved chunks?
Answer relevance: Does the answer actually address the question?

Build a golden dataset of (question, expected answer, relevant chunks) tuples. Run it every time you change your chunking, embedding model, or retrieval strategy. Without this, you're flying blind.

RAG is a solvable problem—but it's an engineering problem, not a demo problem. Treat it like one.