← Back to tutorials

Semantic Search Implementation: Complete Developer Guide 2026

Master Semantic Search Implementation with practical examples and production patterns

Semantic Search Implementation: Complete Developer Guide (2026)

Semantic search finds results by meaning rather than exact keywords. You convert text into embeddings (dense vectors), store them in a vector index, and at query time embed the query and retrieve the nearest vectors. It's the retrieval half of RAG and the reason "find docs about cancelling a subscription" matches a page titled "How to end your plan."

The pipeline

  • Chunk your documents into passages (a few hundred tokens each).
  • Embed each chunk with an embedding model.
  • Store vectors + metadata in a vector database.
  • Query: embed the user's text, retrieve the top-k nearest chunks.
  • (Optional) Rerank the top-k for precision before passing to an LLM.
  • python
    

    pip install openai

    from openai import OpenAI client = OpenAI()

    def embed(texts): r = client.embeddings.create(model="text-embedding-3-small", input=texts) return [d.embedding for d in r.data]

    docs = ["How to cancel your subscription", "Resetting your password", "Billing FAQ"] doc_vecs = embed(docs)

    import numpy as np def search(query, k=2): q = np.array(embed([query])[0]) sims = [float(np.dot(q, v) / (np.linalg.norm(q) * np.linalg.norm(v))) for v in doc_vecs] return sorted(zip(docs, sims), key=lambda x: -x[1])[:k]

    print(search("end my plan")) # → "How to cancel your subscription" ranks first

    The numpy version is for illustration — in production use a vector database so search stays fast at scale.

    Picking a vector store

  • Prototyping / local: Chroma or pgvector — see Chroma vs Qdrant and pgvector 指南.
  • Production / scale: Qdrant, Pinecone, or Weaviate — see Pinecone vs Weaviate.
  • Quality levers

  • Chunking matters more than the embedding model. Too large and retrieval is imprecise; too small and you lose context. Try 200–500 tokens with slight overlap.
  • Hybrid search (combine semantic + keyword/BM25) catches exact terms (codes, names) that pure vectors miss.
  • Reranking a generous top-k with a cross-encoder sharply improves precision before the LLM sees it.
  • Metadata filtering (date, source, language) narrows the candidate set first.
  • This whole flow is the retrieval stage of RAG — to build the full pipeline, see LangChain vs LlamaIndex for RAG and LlamaIndex 生产级 RAG.

    FAQ

    Embeddings vs keyword search? Keyword matches exact terms; embeddings match meaning. Hybrid uses both. Which embedding model? A current general-purpose model is fine to start; the bigger wins come from chunking and reranking. How many results to retrieve? Retrieve a generous top-k (e.g. 20), rerank, then pass the best few to the LLM.

    Summary

    Semantic search = embed, store, retrieve by nearest-neighbor. Get chunking right, add hybrid search and reranking for precision, and choose a vector store that matches your scale. It's the backbone of every RAG system.


    *Last updated: June 2026. Verify embedding APIs against the OpenAI docs.*

    Also available in 中文.