Vector Databases & RAG in Production: Pinecone, Weaviate & pgvector in 2025

Build production-grade retrieval-augmented generation systems with vector search at scale

返回教程列表
高级23 分钟

Vector Databases & RAG in Production: Pinecone, Weaviate & pgvector in 2025

Build production-grade retrieval-augmented generation systems with vector search at scale

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLMs with up-to-date knowledge. This guide covers vector database selection (Pinecone, Weaviate, Qdrant, pgvector), embedding model selection and optimization, chunking strategies for documents, hybrid search (vector + keyword), re-ranking, evaluating RAG quality, and deploying production RAG systems that stay accurate over time.

RAGVector DatabasePineconeWeaviatepgvectorLLMEmbeddings

Vector Databases & RAG in Production 2025

Why RAG?

LLMs have a knowledge cutoff and cannot access proprietary data. RAG solves this by retrieving relevant documents at query time and including them in the prompt context. Benefits: no fine-tuning required for new knowledge, citations and source attribution, real-time knowledge updates, reduced hallucination for factual queries.

Vector Database Selection

Pinecone

Fully managed, optimized for production at scale. Serverless and pod-based options. 1B+ vectors with sub-100ms search. Metadata filtering alongside vector search. Strong consistency. Best for: teams that want zero infrastructure management.

Weaviate

Open-source, self-hosted or managed cloud. Built-in multi-modal support (text + images). Hybrid search (BM25 + vector) out of the box. GraphQL and REST APIs. Best for: teams needing flexibility and multi-modal RAG.

Qdrant

Open-source, Rust-based (high performance). Quantization support for 4x memory reduction. Named vectors (multiple embeddings per document). Best for: high-performance on-premise deployments.

pgvector

PostgreSQL extension—add vector search to existing Postgres. HNSW and IVFFlat indexes. Combine with full SQL queries. Best for: teams already on Postgres who want simple vector search without a new system.

Comparison Matrix

Pinecone: managed/ease of use A+, scale A+, features A, cost C. Weaviate: managed/ease of use B+, scale A, features A+, cost A (self-hosted). Qdrant: managed/ease of use B, scale A+, features A, cost A. pgvector: managed/ease of use A (if on Postgres), scale B, features C, cost A.

Document Processing Pipeline

Chunking Strategies

Fixed-size chunking: split every N tokens (e.g., 512 tokens with 50-token overlap). Simple but ignores document structure.

Semantic chunking: split on natural boundaries (paragraphs, sections, sentences). Preserves context. Use LangChain's semantic text splitter or LlamaIndex's sentence window splitter.

Hierarchical chunking: store document summary + section summaries + detailed chunks. Small-to-big retrieval: find relevant small chunks, expand to larger parent context for the LLM.

Embedding Model Selection

OpenAI text-embedding-3-large: 3072 dimensions, best quality, $0.00013/1K tokens. text-embedding-3-small: 1536 dims, good quality, $0.00002/1K tokens.

Open-source alternatives: Cohere Embed v3 (multilingual, strong for enterprise), E5-large-v2 (strong open-source), all-MiniLM-L6-v2 (fast, small, good for low-latency), BGE-M3 (multilingual, strong cross-lingual retrieval).

Choose based on: query language (multilingual needs multilingual model), latency requirements (larger models are slower), cost (open-source for high-volume use cases).

Hybrid Search

Combining vector search (semantic similarity) with keyword search (BM25/TF-IDF) significantly improves retrieval quality.

Reciprocal Rank Fusion (RRF): get top-K results from vector search (ranked), get top-K from keyword search (ranked), apply RRF formula to combine scores: score(d) = sum over each ranker of 1/(k + rank(d)), where k=60 (constant). Sort by combined score. This outperforms either method alone, especially for queries with specific technical terms.

Re-ranking with Cross-Encoders

Two-stage retrieval: Stage 1 (fast): retrieve top-100 candidates with bi-encoder (vector search). Stage 2 (accurate): re-rank top-100 with cross-encoder (Cohere Rerank or BGE-reranker). Return top-5 re-ranked results to LLM.

Cross-encoders are slower (compare query+document together) but more accurate than bi-encoders. Use for final ranking, not initial retrieval.

RAG Evaluation

RAGAS framework evaluates: faithfulness (is the answer grounded in retrieved context?), answer relevancy (does the answer address the question?), context precision (are retrieved chunks relevant?), context recall (were all relevant chunks retrieved?).

Run RAGAS on a test set of 100-500 question-answer pairs. Target: faithfulness > 0.90, context precision > 0.75.

LLM-as-judge: use GPT-4 or Claude to evaluate answer quality on a 1-5 scale. Automated evaluation at scale for regression testing.

Production RAG Architecture

Query → Rewrite (HyDE or query expansion) → Retrieve (hybrid search, top-100) → Re-rank (top-5) → Generate (LLM with context) → Post-process (citation extraction, formatting) → Response.

Caching: cache embeddings for repeated queries, cache LLM responses for identical query+context pairs.

Index freshness: schedule daily or real-time re-embedding for updated documents. Use document change detection to avoid re-embedding unchanged content.

RAG systems degrade over time as the knowledge base becomes stale—implement automated freshness monitoring and update pipelines for production reliability.

相关工具

PineconeWeaviateQdrantpgvectorLangChainLlamaIndex