Build a Production RAG System with LlamaIndex and Pinecone

Step-by-step guide to retrieval-augmented generation that works on real data

返回教程列表
高级20 分钟

Build a Production RAG System with LlamaIndex and Pinecone

Step-by-step guide to retrieval-augmented generation that works on real data

Most RAG tutorials only show the happy path. This guide builds a production-ready RAG system covering chunking strategies, embedding selection, reranking, evaluation, and edge case handling.

ragllamaindexpineconevector databaseretrieval augmented generationllm

Production RAG System with LlamaIndex and Pinecone

Architecture Overview

User Query -> Embed (text-embedding-3-small) -> Retrieve top-10 from Pinecone -> Rerank with Cohere (get top-3) -> Generate with GPT-4o-mini -> Return answer + sources

Installation

bash
pip install llama-index llama-index-vector-stores-pinecone \
    llama-index-embeddings-openai llama-index-postprocessor-cohere-rerank \
    pinecone-client pypdf

Step 1: Chunking Strategy

Bad chunking breaks retrieval. Rules:

  • Legal/contracts: 256 tokens, high overlap
  • Technical docs: 512 tokens, semantic splitting
  • Tables and lists: keep intact, never split mid-table
  • python
    from llama_index.core.node_parser import SentenceSplitter

    parser = SentenceSplitter( chunk_size=512, chunk_overlap=64, paragraph_separator='\n\n' ) documents = SimpleDirectoryReader('./data').load_data() nodes = parser.get_nodes_from_documents(documents) print(f'Created {len(nodes)} chunks')

    Step 2: Pinecone Index

    python
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(api_key='your-key') if 'docs-index' not in [i.name for i in pc.list_indexes()]: pc.create_index( name='docs-index', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1') )

    Step 3: Reranking (Biggest Quality Boost)

    Vector similarity retrieves semantically related chunks. Reranking finds the most answerable ones. This step improves accuracy by 20-35%.

    python
    from llama_index.postprocessor.cohere_rerank import CohereRerank

    reranker = CohereRerank( api_key='your-cohere-key', top_n=3, model='rerank-english-v3.0' )

    query_engine = RetrieverQueryEngine( retriever=retriever, node_postprocessors=[reranker], response_synthesizer=synthesizer )

    Step 4: Handling Edge Cases

    python
    def safe_query(engine, query: str, min_score: float = 0.7):
        response = engine.query(query)
        if not response.source_nodes or response.source_nodes[0].score < min_score:
            return {
                'answer': 'I do not have enough relevant information to answer confidently.',
                'sources': [],
                'confidence': 'low'
            }
        return {
            'answer': response.response,
            'sources': [n.metadata.get('file_name') for n in response.source_nodes],
            'confidence': 'high'
        }
    

    Evaluation

    Do not ship without evaluation:

    python
    from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

    faithfulness_eval = FaithfulnessEvaluator() # Is answer grounded in docs? relevancy_eval = RelevancyEvaluator() # Are retrieved docs relevant?

    Performance

  • With reranking: 2-4 second query latency
  • Without reranking: 0.5-1 second
  • Accuracy improvement from reranking: +20-35%
  • The latency cost of reranking is worth it when accuracy matters.

    相关工具

    LlamaIndexPineconeOpenAICohere