← Back to tutorials

RAG Knowledge Base Pitfall Guide: Full Analysis of Chunking Strategies, Embedding Models, and Retrieval Tuning

A systematic approach to fixing RAG's 'irrelevant answers', 'missing information', and 'hallucinations'

Direct Answer

Top 3 reasons for poor RAG performance (ranked by impact):

  • Wrong chunking strategy: Cutting off coherent logic, leaving retrieved results lacking key context (accounts for 45% of issues)
  • Mismatched embedding model: Using an English model for Chinese, causing inaccurate semantic similarity (accounts for 30%)
  • Untuned retrieval parameters: Default Top K=5 + threshold 0.7 is unsuitable for most scenarios (accounts for 25%)

  • Complete RAG System Architecture

    
    [Document Preprocessing] → Cleaning → Chunking → Metadata Annotation
          ↓
    [Vectorization] → Embedding Model → Vector Store (Chroma/Qdrant/Weaviate)
          ↓
    [Retrieval] → Vector Search + BM25 → Reranker Re-ranking
          ↓
    [Generation] → Context Assembly → LLM → Final Answer
    

    Every step can introduce errors, requiring layer-by-layer optimization.


    Layer 1: Document Preprocessing

    Most overlooked pitfall in RAG: 60% of retrieval failures happen during document preprocessing, not in the retrieval algorithm.

    python
    import re

    def clean_document(text: str) -> str: lines = text.split("\n") # Filter out short lines like page numbers, repeated headers (<10 chars) lines = [l for l in lines if len(l.strip()) > 10 or l.strip() == ""] text = "\n".join(lines) # Merge sentences broken by line breaks during PDF extraction text = re.sub(r"(?


    Layer 2: Chunking Strategy

    StrategySuitable ScenariosCode Solution

    Fixed sizeFAQs, simple documentationRecursiveCharacterTextSplitter Semantic chunkingLong reports, technical docs (recommended)SemanticChunker Hierarchical chunking (parent-child)Structured manuals, multi-level directoriesParentDocumentRetriever Header-based chunkingMarkdown documentsMarkdownHeaderTextSplitter

    Semantic chunking example (best performance):

    python
    from langchain_experimental.text_splitter import SemanticChunker
    from langchain_openai import OpenAIEmbeddings

    splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="percentile", breakpoint_threshold_amount=90 ) chunks = splitter.create_documents([document])


    Layer 3: Embedding Model Selection

    ModelChinese PerformancePriceRecommended Scenario

    BGE-M3 (Open Source)⭐⭐⭐⭐⭐FreeBest for Chinese, can run locally OpenAI text-embedding-3-large⭐⭐⭐⭐$0.13/1MBest for English/multilingual Jina-embeddings-v3⭐⭐⭐⭐Low costBalanced for Chinese and English

    python
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("BAAI/bge-m3") embeddings = model.encode(texts, batch_size=32, normalize_embeddings=True)


    Layer 4: Hybrid Retrieval

    Why pure vector retrieval isn't enough?

    
    User asks: "What was the refund amount for Q3 2024?"

    Pure vector retrieval → Finds "Refund Policy Description" (semantically similar) but no specific numbers ❌ BM25 retrieval → Exact match "Q3 2024" and "refund amount" ✅ Hybrid retrieval → Vector finds relevant section + BM25 pinpoints the number ✅✅

    python
    from langchain.retrievers import EnsembleRetriever

    Hybrid retrieval: vector 60% + BM25 40%

    ensemble = EnsembleRetriever( retrievers=[vector_retriever, bm25_retriever], weights=[0.6, 0.4] )


    Layer 5: Reranker Re-ranking (Single Most Important Optimization)

    Improves accuracy by +15-25%: first recall 20 results, then Reranker selects Top 4.

    python
    from sentence_transformers import CrossEncoder

    reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

    def rerank(query, docs, top_n=4): pairs = [[query, doc] for doc in docs] scores = reranker.predict(pairs) ranked = sorted(zip(scores, docs), reverse=True) return [doc for _, doc in ranked[:top_n]]

    candidates = ensemble.invoke(query) # recall 20 results top_docs = rerank(query, [c.page_content for c in candidates]) # select top 4


    System Evaluation: RAGAS Framework

    python
    from ragas import evaluate
    from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

    result = evaluate(dataset=test_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

    Each metric 0-1: 0.8+ acceptable, 0.9+ excellent

    print(result)


    FAQ

    Q: When to use RAG vs. Fine-tuning?

    A: Frequently updated knowledge → RAG; stable knowledge requiring specific style/format → Fine-tuning; both can be combined (fine-tuned model + RAG knowledge base).

    Q: Why does vector search sometimes return irrelevant content?

    A: All vector searches have similarity scores; below a threshold it's a "forced match." Set a score_threshold (0.5-0.6) and explicitly state in the prompt: "If the retrieved information is insufficient to answer, say you don't know."

    Q: How to handle charts in documents?

    A: 1) Use GPT-4V to convert charts to text descriptions before storing; 2) Generate natural language descriptions for charts separately (e.g., "Figure 3 shows quarterly sales for 2024, with Q3 highest at 12 million").


    Related Resources

  • Dify Knowledge Base Setup: aiskillnav.com/tutorials/dify-enterprise-knowledge-base
  • Vector Database Selection: aiskillnav.com/tutorials/vector-database-comparison-pinecone-weaviate-chroma-2026
  • AI Agent Tool Directory: aiskillnav.com/agents
  • Also available in 中文.