← Back to tutorials

LLM Text Deduplication: Practical Tutorial

Using AI embeddings to deduplicate large text datasets

LLM Text Deduplication: Practical Tutorial

Duplicates in text datasets — support tickets filed twice, the same product listed five ways, near-identical articles — break analytics, bloat RAG indexes, and skew training data. Exact-match dedup catches none of the interesting cases ("Can't log in to my account" vs "login isnt working for me"). The production answer is a three-stage funnel: cheap hashing → embedding similarity → LLM adjudication, using the expensive tool only where it's needed.

Why the funnel shape

Comparing all pairs is O(n²) — 1M records is 500 *billion* pairs; you cannot LLM (or even embed-compare) your way through that naively. Each stage cuts candidates by orders of magnitude:

  • Stage 1 — normalization + hashing (free): lowercase, strip punctuation/whitespace, hash. Catches trivial dupes instantly. MinHash/SimHash extends this to "mostly same words" pairs at scale — the classic pre-LLM technique, still the right first pass for web-scale corpora.
  • Stage 2 — embeddings + nearest neighbors (cheap): embed every record once, find near-neighbor pairs above a similarity threshold. This catches *semantic* duplicates regardless of wording.
  • Stage 3 — LLM verdicts on the gray zone (precise): high similarity ≠ duplicate ("iPhone 15 case, black" vs "iPhone 15 case, red" embed nearly identically but are different products). The LLM adjudicates exactly these.
  • Stage 2: embeddings + ANN

    python
    import numpy as np
    from openai import OpenAI

    client = OpenAI()

    def embed_batch(texts: list[str]) -> np.ndarray: resp = client.embeddings.create(model='text-embedding-3-small', input=texts) return np.array([d.embedding for d in resp.data])

    For real scale, put vectors in pgvector/FAISS and query neighbors;

    in-memory cosine works to ~50K records:

    emb = embed_batch(records) emb /= np.linalg.norm(emb, axis=1, keepdims=True) sims = emb @ emb.T cand_pairs = [(i, j) for i in range(len(records)) for j in range(i+1, len(records)) if sims[i, j] > 0.85] # threshold: see calibration below

    Threshold calibration is the whole game here: label ~200 pairs across similarity bands once, then pick two cutoffs — above ~0.95 auto-merge, below ~0.80 auto-distinct, and only the band between goes to Stage 3. That band is typically a few percent of pairs, which is what makes LLM adjudication affordable. (Store vectors in pgvector and this doubles as your search index.)

    Stage 3: LLM adjudication

    python
    def judge(a: str, b: str) -> dict:
        prompt = f'''Are these two records duplicates (same real-world thing), or distinct?
    Variations in wording/typos/format = duplicate. Different size/color/version/intent = distinct.
    A: {a}
    B: {b}
    JSON: {{"verdict": "duplicate"|"distinct", "reason": ""}}'''
        resp = client.chat.completions.create(
            model='gpt-4o-mini',
            response_format={'type': 'json_object'},
            messages=[{'role': 'user', 'content': prompt}],
        )
        return json.loads(resp.choices[0].message.content)
    

    Notes that matter in production:

  • Define "duplicate" for your domain in the prompt — the size/color/version line above is doing the real work; tickets, products, and articles each need their own definition of "same thing".
  • Batch the gray zone through the Batch API — adjudication is never urgent.
  • Cluster, don't just pair: union-find over confirmed-duplicate pairs gives you groups; keep the best exemplar per group (longest, most complete, or LLM-picked) and map the rest to it.
  • Where this pipeline earns its keep

  • RAG index hygiene: duplicate chunks waste retrieval slots and make answers repetitive — dedup before embedding into your store (semantic search guide).
  • Support/CRM data: ticket and account dedup with entity-aware judging ("J. Smith, Acme" vs "John Smith, Acme Corp") — same funnel, with enrichment normalizing fields first.
  • Training/eval data: near-dupes between train and eval sets inflate metrics — deduplicate *across* splits, not just within.
  • Operational notes

  • Provenance: log every merge (pair, verdict, reason, model version) — merges are destructive-ish; you want an undo trail.
  • Incremental mode: new records only compare against existing cluster exemplars — keeps daily runs cheap.
  • Audit the auto-merge band quarterly: thresholds drift as your data distribution changes; re-run the 200-pair calibration when sources change.
  • FAQ

    Cost ballpark? Embeddings are fractions of a cent per thousand short records; LLM adjudication only touches the gray band. A 100K-record dedup typically lands in single-digit dollars — the design above is *why*.

    Can I skip embeddings and hash only? For format-identical data, maybe; you'll miss every paraphrase. The embedding stage is what finds "same meaning, different words".

    Cross-lingual duplicates? Multilingual embedding models cluster translations together — the same funnel works; just confirm your Stage-3 prompt states that translations count as duplicates (or not) for your use case.


    *Last updated: June 2026.*

    Also available in 中文.