Build a Production RAG System with LlamaIndex and Pinecone

Step-by-step guide to retrieval-augmented generation that works on real data

高级约 20 分钟

Build a Production RAG System with LlamaIndex and Pinecone

Step-by-step guide to retrieval-augmented generation that works on real data

Most RAG tutorials only show the happy path. This guide builds a production-ready RAG system covering chunking strategies, embedding selection, reranking, evaluation, and edge case handling.

ragllamaindexpineconevector databaseretrieval augmented generationllm

Production RAG System with LlamaIndex and Pinecone

Architecture Overview

User Query -> Embed (text-embedding-3-small) -> Retrieve top-10 from Pinecone -> Rerank with Cohere (get top-3) -> Generate with GPT-4o-mini -> Return answer + sources

Installation

bash
pip install llama-index llama-index-vector-stores-pinecone \
    llama-index-embeddings-openai llama-index-postprocessor-cohere-rerank \
    pinecone-client pypdf

Step 1: Chunking Strategy

Bad chunking breaks retrieval. Rules:

Legal/contracts: 256 tokens, high overlap

Technical docs: 512 tokens, semantic splitting

Tables and lists: keep intact, never split mid-table

python
from llama_index.core.node_parser import SentenceSplitterparser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64,
    paragraph_separator='\n\n'
)
documents = SimpleDirectoryReader('./data').load_data()
nodes = parser.get_nodes_from_documents(documents)
print(f'Created {len(nodes)} chunks')

Step 2: Pinecone Index

python
from pinecone import Pinecone, ServerlessSpecpc = Pinecone(api_key='your-key')
if 'docs-index' not in [i.name for i in pc.list_indexes()]:
    pc.create_index(
        name='docs-index',
        dimension=1536,
        metric='cosine',
        spec=ServerlessSpec(cloud='aws', region='us-east-1')
    )

Step 3: Reranking (Biggest Quality Boost)

Vector similarity retrieves semantically related chunks. Reranking finds the most answerable ones. This step improves accuracy by 20-35%.

python
from llama_index.postprocessor.cohere_rerank import CohereRerank
reranker = CohereRerank(
    api_key='your-cohere-key',
    top_n=3,
    model='rerank-english-v3.0'
)query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reranker],
    response_synthesizer=synthesizer
)

Step 4: Handling Edge Cases

python
def safe_query(engine, query: str, min_score: float = 0.7):
    response = engine.query(query)
    if not response.source_nodes or response.source_nodes[0].score < min_score:
        return {
            'answer': 'I do not have enough relevant information to answer confidently.',
            'sources': [],
            'confidence': 'low'
        }
    return {
        'answer': response.response,
        'sources': [n.metadata.get('file_name') for n in response.source_nodes],
        'confidence': 'high'
    }

Evaluation

Do not ship without evaluation:

python
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluatorfaithfulness_eval = FaithfulnessEvaluator()  # Is answer grounded in docs?
relevancy_eval = RelevancyEvaluator()         # Are retrieved docs relevant?

Performance

With reranking: 2-4 second query latency

Without reranking: 0.5-1 second

Accuracy improvement from reranking: +20-35%

The latency cost of reranking is worth it when accuracy matters.

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

Build a Production RAG System with LlamaIndex and Pinecone

Production RAG System with LlamaIndex and Pinecone

Architecture Overview

Installation

Step 1: Chunking Strategy

Step 2: Pinecone Index

Step 3: Reranking (Biggest Quality Boost)

Step 4: Handling Edge Cases

Evaluation

Performance

Documentation

Getting Started

Learn more