RAG Knowledge Base Pitfall Guide: Full Analysis of Chunking Strategies, Embedding Models, and Retrieval Tuning
A systematic approach to fixing RAG's 'irrelevant answers', 'missing information', and 'hallucinations'
Direct Answer
Top 3 reasons for poor RAG performance (ranked by impact):
Complete RAG System Architecture
[Document Preprocessing] → Cleaning → Chunking → Metadata Annotation
↓
[Vectorization] → Embedding Model → Vector Store (Chroma/Qdrant/Weaviate)
↓
[Retrieval] → Vector Search + BM25 → Reranker Re-ranking
↓
[Generation] → Context Assembly → LLM → Final Answer
Every step can introduce errors, requiring layer-by-layer optimization.
Layer 1: Document Preprocessing
Most overlooked pitfall in RAG: 60% of retrieval failures happen during document preprocessing, not in the retrieval algorithm.
python
import redef clean_document(text: str) -> str:
lines = text.split("\n")
# Filter out short lines like page numbers, repeated headers (<10 chars)
lines = [l for l in lines if len(l.strip()) > 10 or l.strip() == ""]
text = "\n".join(lines)
# Merge sentences broken by line breaks during PDF extraction
text = re.sub(r"(?
Layer 2: Chunking Strategy
Semantic chunking example (best performance):
python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddingssplitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90
)
chunks = splitter.create_documents([document])
Layer 3: Embedding Model Selection
python
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("BAAI/bge-m3")
embeddings = model.encode(texts, batch_size=32, normalize_embeddings=True)
Layer 4: Hybrid Retrieval
Why pure vector retrieval isn't enough?
User asks: "What was the refund amount for Q3 2024?"Pure vector retrieval → Finds "Refund Policy Description" (semantically similar) but no specific numbers ❌
BM25 retrieval → Exact match "Q3 2024" and "refund amount" ✅
Hybrid retrieval → Vector finds relevant section + BM25 pinpoints the number ✅✅
python
from langchain.retrievers import EnsembleRetrieverHybrid retrieval: vector 60% + BM25 40%
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4]
)
Layer 5: Reranker Re-ranking (Single Most Important Optimization)
Improves accuracy by +15-25%: first recall 20 results, then Reranker selects Top 4.
python
from sentence_transformers import CrossEncoderreranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def rerank(query, docs, top_n=4):
pairs = [[query, doc] for doc in docs]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, docs), reverse=True)
return [doc for _, doc in ranked[:top_n]]
candidates = ensemble.invoke(query) # recall 20 results
top_docs = rerank(query, [c.page_content for c in candidates]) # select top 4
System Evaluation: RAGAS Framework
python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recallresult = evaluate(dataset=test_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
Each metric 0-1: 0.8+ acceptable, 0.9+ excellent
print(result)
FAQ
Q: When to use RAG vs. Fine-tuning?
A: Frequently updated knowledge → RAG; stable knowledge requiring specific style/format → Fine-tuning; both can be combined (fine-tuned model + RAG knowledge base).
Q: Why does vector search sometimes return irrelevant content?
A: All vector searches have similarity scores; below a threshold it's a "forced match." Set a score_threshold (0.5-0.6) and explicitly state in the prompt: "If the retrieved information is insufficient to answer, say you don't know."
Q: How to handle charts in documents?
A: 1) Use GPT-4V to convert charts to text descriptions before storing; 2) Generate natural language descriptions for charts separately (e.g., "Figure 3 shows quarterly sales for 2024, with Q3 highest at 12 million").
Related Resources
Also available in 中文.