← Back to tutorials

Advanced RAG: Complete Guide 2026 – Beyond Basic Retrieval to Build Production-Grade Knowledge Bases

Solving the Three Core Problems: Hallucination, Inaccurate Retrieval, and Context Loss

Most RAG systems don't "fail to work" – they "don't work well enough": retrieving the wrong documents, missing key information in answers, or giving incomplete responses to complex questions.

This article explains how to solve these problems.

1. The Three Core Problems of RAG Systems

Problem 1: Inaccurate Retrieval (Low Recall/Precision)

Symptoms: The user asks a clear question, but the retrieved documents are irrelevant or miss the most important ones.

Root Cause: Limitations of pure vector similarity search

  • Keyword mismatch (user says "price increase," document says "raise selling price")
  • Similarity in vector space ≠ semantic relevance
  • Short queries lack sufficient semantic information
  • Problem 2: Insufficient Context Window (Context Stuffing)

    Symptoms: Stuffing too many documents causes the LLM's attention to scatter, diluting key information.

    Problem 3: Query-Document Mismatch

    Symptoms: The user asks a complex multi-step question, but documents are chunked by single topics, so no single chunk can fully answer the question.


    2. Hybrid Retrieval

    Core solution to Problem 1: Combine vector retrieval and keyword retrieval.

    python
    from langchain.retrievers import EnsembleRetriever
    from langchain_community.retrievers import BM25Retriever
    from langchain_openai import OpenAIEmbeddings
    from langchain_community.vectorstores import Chroma

    Vector retriever

    embeddings = OpenAIEmbeddings() vectorstore = Chroma(embedding_function=embeddings) vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

    BM25 keyword retriever

    bm25_retriever = BM25Retriever.from_documents(documents) bm25_retriever.k = 5

    Hybrid retrieval (RRF fusion)

    ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6] # 40% keyword, 60% vector )

    results = ensemble_retriever.invoke("user query")

    Why it works: BM25 excels at exact keyword matching, while vector retrieval handles semantic understanding – they complement each other.

    3. Reranking

    After retrieving candidate documents, re-rank them with a more refined model:

    python
    from sentence_transformers import CrossEncoder

    Use Cross-Encoder for reranking (more accurate than bi-encoder)

    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank_documents(query, documents, top_k=3): # Score each (query, doc) pair pairs = [(query, doc.page_content) for doc in documents] scores = reranker.predict(pairs) # Re-rank by score ranked = sorted( zip(documents, scores), key=lambda x: x[1], reverse=True ) return [doc for doc, _ in ranked[:top_k]]

    First retrieve broadly, then rerank strictly

    candidates = ensemble_retriever.invoke(query) # retrieve 10 top_docs = rerank_documents(query, candidates, top_k=3) # keep 3

    Reranking typically improves accuracy by 15-30%.

    4. Multi-Query Decomposition

    For complex questions, automatically generate multiple sub-queries:

    python
    from langchain.retrievers.multi_query import MultiQueryRetriever

    Let LLM automatically generate queries from multiple perspectives

    multi_query_retriever = MultiQueryRetriever.from_llm( retriever=ensemble_retriever, llm=llm )

    For the query "How to improve RAG system accuracy"

    It automatically generates:

    1. "RAG retrieval accuracy optimization methods"

    2. "Improving knowledge base QA quality"

    3. "Technical solutions to reduce RAG hallucination"

    Then merges results from all three queries, deduplicating

    5. Query Routing

    Not every question needs to retrieve from the knowledge base:

    python
    def route_query(query):
        """Decide how to handle this query"""
        prompt = f"""Determine how this query should be handled:
    Query: {query}

    Options:

  • knowledge_base - needs internal document retrieval
  • direct_answer - general knowledge, answer directly
  • calculation - needs computation
  • clarification - needs clarification
  • Return the option name only.""" route = llm.invoke(prompt).content.strip() return route

    Choose processing method based on routing result

    query = "What is our product's refund policy?" route = route_query(query)

    if route == "knowledge_base": docs = retriever.invoke(query) answer = rag_chain.invoke({"query": query, "docs": docs}) elif route == "direct_answer": answer = llm.invoke(query)

    6. RAG Evaluation Framework

    You can't judge RAG quality by subjective feeling alone; systematic evaluation is needed:

    python
    

    Use RAGAS framework for evaluation

    from ragas import evaluate from ragas.metrics import ( faithfulness, # Faithfulness: whether answer is based on retrieved documents answer_relevancy, # Relevancy: whether answer addresses the question context_precision, # Precision: whether retrieved documents are relevant context_recall # Recall: whether necessary documents were retrieved )

    Build test set (20-50 Q&A pairs)

    test_dataset = { "question": [...], "answer": [...], # RAG system's answer "contexts": [...], # Retrieved documents "ground_truth": [...] # Ground truth answers }

    result = evaluate(test_dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall ]) print(result)

    Key Metrics:

  • Faithfulness > 0.85: Answer is grounded in documents, no hallucination
  • Answer Relevancy > 0.80: Answer is on-topic
  • Context Precision > 0.75: Retrieval is precise
  • Context Recall > 0.70: No important documents missed

  • Further Reading

  • RAG Knowledge Base Best Practices
  • Building Enterprise Knowledge Base with Dify
  • LangChain vs LangGraph Practical Guide
  • Also available in 中文.

    Advanced RAG: Complete Guide 2026 – Beyond Basic Retrieval to Build Production-Grade Knowledge Bases | AI Skill Navigation | AI Skill Navigation