Building AI-Powered Search with Semantic Retrieval

Replace keyword search with intelligent semantic understanding

返回教程列表
进阶35 分钟

Building AI-Powered Search with Semantic Retrieval

Replace keyword search with intelligent semantic understanding

Learn to build semantic search systems using embeddings, vector databases, and re-ranking. Covers hybrid search combining BM25 with dense retrieval for production search applications.

semantic-searchembeddingsvector-databaseretrievalhybrid-search

AI-Powered Semantic Search

Why Semantic Search?

Traditional keyword search fails when:
  • Users use different words than documents
  • Queries require understanding context
  • Synonyms and related concepts matter
  • Multi-language search is needed
  • Architecture Overview

  • Encode documents as dense vectors (embeddings)
  • Store in a vector database
  • At query time, encode the query
  • Find nearest neighbors by cosine similarity
  • Re-rank results with a cross-encoder
  • Building with OpenAI Embeddings

    python
    import openai
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity

    def embed_texts(texts): response = openai.embeddings.create( model="text-embedding-3-large", input=texts ) return [r.embedding for r in response.data]

    Index documents

    docs = ["Python is a programming language", "Machine learning uses data"] doc_embeddings = embed_texts(docs)

    Search

    query_embedding = embed_texts(["coding with Python"])[0] similarities = cosine_similarity([query_embedding], doc_embeddings)[0] top_idx = np.argsort(similarities)[::-1][:5]

    Hybrid Search with BM25 + Dense Retrieval

    python
    from rank_bm25 import BM25Okapi
    from pinecone import Pinecone

    class HybridSearch: def __init__(self, documents): self.docs = documents self.bm25 = BM25Okapi([d.split() for d in documents]) self.pc = Pinecone(api_key="...") self.index = self.pc.Index("docs") self._index_documents() def search(self, query, alpha=0.5, top_k=10): # BM25 scores bm25_scores = self.bm25.get_scores(query.split()) # Dense retrieval scores query_vec = embed_texts([query])[0] dense_results = self.index.query(vector=query_vec, top_k=top_k) # Combine scores combined = alpha * normalize(bm25_scores) + (1-alpha) * normalize(dense_scores) return sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:top_k]

    Re-ranking with Cross-Encoders

    python
    from sentence_transformers import CrossEncoder

    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank(query, candidates, top_k=5): pairs = [(query, c) for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [c for c, _ in ranked[:top_k]]

    Production Considerations

  • Choose embedding dimensions based on accuracy vs. speed trade-off
  • Index updates: batch vs. real-time
  • Monitor search quality with click-through rates
  • 相关工具

    pineconeopenaiweaviateelasticsearch