Build a Production RAG System with LlamaIndex and Pinecone
Step-by-step guide to retrieval-augmented generation that works on real data
Build a Production RAG System with LlamaIndex and Pinecone
Step-by-step guide to retrieval-augmented generation that works on real data
Most RAG tutorials only show the happy path. This guide builds a production-ready RAG system covering chunking strategies, embedding selection, reranking, evaluation, and edge case handling.
Production RAG System with LlamaIndex and Pinecone
Architecture Overview
User Query -> Embed (text-embedding-3-small) -> Retrieve top-10 from Pinecone -> Rerank with Cohere (get top-3) -> Generate with GPT-4o-mini -> Return answer + sources
Installation
bash
pip install llama-index llama-index-vector-stores-pinecone \
llama-index-embeddings-openai llama-index-postprocessor-cohere-rerank \
pinecone-client pypdf
Step 1: Chunking Strategy
Bad chunking breaks retrieval. Rules:
python
from llama_index.core.node_parser import SentenceSplitterparser = SentenceSplitter(
chunk_size=512,
chunk_overlap=64,
paragraph_separator='\n\n'
)
documents = SimpleDirectoryReader('./data').load_data()
nodes = parser.get_nodes_from_documents(documents)
print(f'Created {len(nodes)} chunks')
Step 2: Pinecone Index
python
from pinecone import Pinecone, ServerlessSpecpc = Pinecone(api_key='your-key')
if 'docs-index' not in [i.name for i in pc.list_indexes()]:
pc.create_index(
name='docs-index',
dimension=1536,
metric='cosine',
spec=ServerlessSpec(cloud='aws', region='us-east-1')
)
Step 3: Reranking (Biggest Quality Boost)
Vector similarity retrieves semantically related chunks. Reranking finds the most answerable ones. This step improves accuracy by 20-35%.
python
from llama_index.postprocessor.cohere_rerank import CohereRerankreranker = CohereRerank(
api_key='your-cohere-key',
top_n=3,
model='rerank-english-v3.0'
)
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[reranker],
response_synthesizer=synthesizer
)
Step 4: Handling Edge Cases
python
def safe_query(engine, query: str, min_score: float = 0.7):
response = engine.query(query)
if not response.source_nodes or response.source_nodes[0].score < min_score:
return {
'answer': 'I do not have enough relevant information to answer confidently.',
'sources': [],
'confidence': 'low'
}
return {
'answer': response.response,
'sources': [n.metadata.get('file_name') for n in response.source_nodes],
'confidence': 'high'
}
Evaluation
Do not ship without evaluation:
python
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluatorfaithfulness_eval = FaithfulnessEvaluator() # Is answer grounded in docs?
relevancy_eval = RelevancyEvaluator() # Are retrieved docs relevant?
Performance
The latency cost of reranking is worth it when accuracy matters.
相关工具
相关教程
Build complex multi-step AI workflows with state management using LangGraph
Chain-of-thought, tree-of-thoughts, self-consistency, and systematic evaluation methods
Deploy Llama 3 with 20x higher throughput than naive serving