RAG Knowledge Base Pitfall Guide: Full Analysis of Chunking Strategies, Embedding Models, and Retrieval Tuning

A systematic approach to fixing RAG's 'irrelevant answers', 'missing information', and 'hallucinations'

Direct Answer

Top 3 reasons for poor RAG performance (ranked by impact):

Wrong chunking strategy: Cutting off coherent logic, leaving retrieved results lacking key context (accounts for 45% of issues)

Mismatched embedding model: Using an English model for Chinese, causing inaccurate semantic similarity (accounts for 30%)

Untuned retrieval parameters: Default Top K=5 + threshold 0.7 is unsuitable for most scenarios (accounts for 25%)

Complete RAG System Architecture


[Document Preprocessing] → Cleaning → Chunking → Metadata Annotation
      ↓
[Vectorization] → Embedding Model → Vector Store (Chroma/Qdrant/Weaviate)
      ↓
[Retrieval] → Vector Search + BM25 → Reranker Re-ranking
      ↓
[Generation] → Context Assembly → LLM → Final Answer

Every step can introduce errors, requiring layer-by-layer optimization.

Layer 1: Document Preprocessing

Most overlooked pitfall in RAG: 60% of retrieval failures happen during document preprocessing, not in the retrieval algorithm.

python
import re
def clean_document(text: str) -> str:
    lines = text.split("\n")
    # Filter out short lines like page numbers, repeated headers (<10 chars)
    lines = [l for l in lines if len(l.strip()) > 10 or l.strip() == ""]
    text = "\n".join(lines)
    # Merge sentences broken by line breaks during PDF extraction
    text = re.sub(r"(?

`Layer 2: Chunking Strategy`

StrategySuitable ScenariosCode Solution

Fixed sizeFAQs, simple documentationRecursiveCharacterTextSplitter Semantic chunkingLong reports, technical docs (recommended)SemanticChunker Hierarchical chunking (parent-child)Structured manuals, multi-level directoriesParentDocumentRetriever Header-based chunkingMarkdown documentsMarkdownHeaderTextSplitter

Semantic chunking example (best performance):

python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddingssplitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90
)
chunks = splitter.create_documents([document])

`Layer 3: Embedding Model Selection`

ModelChinese PerformancePriceRecommended Scenario

BGE-M3 (Open Source)⭐⭐⭐⭐⭐FreeBest for Chinese, can run locally OpenAI text-embedding-3-large⭐⭐⭐⭐$0.13/1MBest for English/multilingual Jina-embeddings-v3⭐⭐⭐⭐Low costBalanced for Chinese and English

python
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("BAAI/bge-m3")
embeddings = model.encode(texts, batch_size=32, normalize_embeddings=True)

`Layer 4: Hybrid Retrieval`

Why pure vector retrieval isn't enough?


User asks: "What was the refund amount for Q3 2024?"Pure vector retrieval → Finds "Refund Policy Description" (semantically similar) but no specific numbers ❌
BM25 retrieval   → Exact match "Q3 2024" and "refund amount" ✅
Hybrid retrieval → Vector finds relevant section + BM25 pinpoints the number ✅✅

python
from langchain.retrievers import EnsembleRetriever
Hybrid retrieval: vector 60% + BM25 40%
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

`Layer 5: Reranker Re-ranking (Single Most Important Optimization)`

Improves accuracy by +15-25%: first recall 20 results, then Reranker selects Top 4.

python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def rerank(query, docs, top_n=4):
    pairs = [[query, doc] for doc in docs]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, docs), reverse=True)
    return [doc for _, doc in ranked[:top_n]]candidates = ensemble.invoke(query)   # recall 20 results
top_docs = rerank(query, [c.page_content for c in candidates])  # select top 4

`System Evaluation: RAGAS Framework`

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
result = evaluate(dataset=test_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
Each metric 0-1: 0.8+ acceptable, 0.9+ excellent
print(result)

`FAQ`

Q: When to use RAG vs. Fine-tuning?

A: Frequently updated knowledge → RAG; stable knowledge requiring specific style/format → Fine-tuning; both can be combined (fine-tuned model + RAG knowledge base).

Q: Why does vector search sometimes return irrelevant content?

A: All vector searches have similarity scores; below a threshold it's a "forced match." Set a score_threshold (0.5-0.6) and explicitly state in the prompt: "If the retrieved information is insufficient to answer, say you don't know."

Q: How to handle charts in documents?

A: 1) Use GPT-4V to convert charts to text descriptions before storing; 2) Generate natural language descriptions for charts separately (e.g., "Figure 3 shows quarterly sales for 2024, with Q3 highest at 12 million").

`Related Resources`

Dify Knowledge Base Setup: aiskillnav.com/tutorials/dify-enterprise-knowledge-base


Vector Database Selection: aiskillnav.com/tutorials/vector-database-comparison-pinecone-weaviate-chroma-2026
AI Agent Tool Directory: aiskillnav.com/agents

Also available in 中文.