Advanced RAG: Complete Guide 2026 – Beyond Basic Retrieval to Build Production-Grade Knowledge Bases
Solving the Three Core Problems: Hallucination, Inaccurate Retrieval, and Context Loss
Most RAG systems don't "fail to work" – they "don't work well enough": retrieving the wrong documents, missing key information in answers, or giving incomplete responses to complex questions.
This article explains how to solve these problems.
1. The Three Core Problems of RAG Systems
Problem 1: Inaccurate Retrieval (Low Recall/Precision)
Symptoms: The user asks a clear question, but the retrieved documents are irrelevant or miss the most important ones.
Root Cause: Limitations of pure vector similarity search
Problem 2: Insufficient Context Window (Context Stuffing)
Symptoms: Stuffing too many documents causes the LLM's attention to scatter, diluting key information.
Problem 3: Query-Document Mismatch
Symptoms: The user asks a complex multi-step question, but documents are chunked by single topics, so no single chunk can fully answer the question.
2. Hybrid Retrieval
Core solution to Problem 1: Combine vector retrieval and keyword retrieval.
python
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import ChromaVector retriever
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5Hybrid retrieval (RRF fusion)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # 40% keyword, 60% vector
)results = ensemble_retriever.invoke("user query")
Why it works: BM25 excels at exact keyword matching, while vector retrieval handles semantic understanding – they complement each other.
3. Reranking
After retrieving candidate documents, re-rank them with a more refined model:
python
from sentence_transformers import CrossEncoderUse Cross-Encoder for reranking (more accurate than bi-encoder)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')def rerank_documents(query, documents, top_k=3):
# Score each (query, doc) pair
pairs = [(query, doc.page_content) for doc in documents]
scores = reranker.predict(pairs)
# Re-rank by score
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, _ in ranked[:top_k]]
First retrieve broadly, then rerank strictly
candidates = ensemble_retriever.invoke(query) # retrieve 10
top_docs = rerank_documents(query, candidates, top_k=3) # keep 3
Reranking typically improves accuracy by 15-30%.
4. Multi-Query Decomposition
For complex questions, automatically generate multiple sub-queries:
python
from langchain.retrievers.multi_query import MultiQueryRetrieverLet LLM automatically generate queries from multiple perspectives
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=ensemble_retriever,
llm=llm
)For the query "How to improve RAG system accuracy"
It automatically generates:
1. "RAG retrieval accuracy optimization methods"
2. "Improving knowledge base QA quality"
3. "Technical solutions to reduce RAG hallucination"
Then merges results from all three queries, deduplicating
5. Query Routing
Not every question needs to retrieve from the knowledge base:
python
def route_query(query):
"""Decide how to handle this query"""
prompt = f"""Determine how this query should be handled:
Query: {query}Options:
knowledge_base - needs internal document retrieval
direct_answer - general knowledge, answer directly
calculation - needs computation
clarification - needs clarification Return the option name only."""
route = llm.invoke(prompt).content.strip()
return route
Choose processing method based on routing result
query = "What is our product's refund policy?"
route = route_query(query)if route == "knowledge_base":
docs = retriever.invoke(query)
answer = rag_chain.invoke({"query": query, "docs": docs})
elif route == "direct_answer":
answer = llm.invoke(query)
6. RAG Evaluation Framework
You can't judge RAG quality by subjective feeling alone; systematic evaluation is needed:
python
Use RAGAS framework for evaluation
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Faithfulness: whether answer is based on retrieved documents
answer_relevancy, # Relevancy: whether answer addresses the question
context_precision, # Precision: whether retrieved documents are relevant
context_recall # Recall: whether necessary documents were retrieved
)Build test set (20-50 Q&A pairs)
test_dataset = {
"question": [...],
"answer": [...], # RAG system's answer
"contexts": [...], # Retrieved documents
"ground_truth": [...] # Ground truth answers
}result = evaluate(test_dataset, metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
])
print(result)
Key Metrics:
Further Reading
Also available in 中文.