LLM Text Deduplication: Practical Tutorial

Using AI embeddings to deduplicate large text datasets

LLM Text Deduplication: Practical Tutorial

Duplicates in text datasets — support tickets filed twice, the same product listed five ways, near-identical articles — break analytics, bloat RAG indexes, and skew training data. Exact-match dedup catches none of the interesting cases ("Can't log in to my account" vs "login isnt working for me"). The production answer is a three-stage funnel: cheap hashing → embedding similarity → LLM adjudication, using the expensive tool only where it's needed.

Why the funnel shape

Comparing all pairs is O(n²) — 1M records is 500 *billion* pairs; you cannot LLM (or even embed-compare) your way through that naively. Each stage cuts candidates by orders of magnitude:

Stage 1 — normalization + hashing (free): lowercase, strip punctuation/whitespace, hash. Catches trivial dupes instantly. MinHash/SimHash extends this to "mostly same words" pairs at scale — the classic pre-LLM technique, still the right first pass for web-scale corpora.

Stage 2 — embeddings + nearest neighbors (cheap): embed every record once, find near-neighbor pairs above a similarity threshold. This catches *semantic* duplicates regardless of wording.

Stage 3 — LLM verdicts on the gray zone (precise): high similarity ≠ duplicate ("iPhone 15 case, black" vs "iPhone 15 case, red" embed nearly identically but are different products). The LLM adjudicates exactly these.

Stage 2: embeddings + ANN

python
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed_batch(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model='text-embedding-3-small', input=texts)
    return np.array([d.embedding for d in resp.data])
For real scale, put vectors in pgvector/FAISS and query neighbors;
in-memory cosine works to ~50K records:
emb = embed_batch(records)
emb /= np.linalg.norm(emb, axis=1, keepdims=True)
sims = emb @ emb.T
cand_pairs = [(i, j) for i in range(len(records)) for j in range(i+1, len(records))
              if sims[i, j] > 0.85]          # threshold: see calibration below

Threshold calibration is the whole game here: label ~200 pairs across similarity bands once, then pick two cutoffs — above ~0.95 auto-merge, below ~0.80 auto-distinct, and only the band between goes to Stage 3. That band is typically a few percent of pairs, which is what makes LLM adjudication affordable. (Store vectors in pgvector and this doubles as your search index.)

Stage 3: LLM adjudication

python
def judge(a: str, b: str) -> dict:
    prompt = f'''Are these two records duplicates (same real-world thing), or distinct?
Variations in wording/typos/format = duplicate. Different size/color/version/intent = distinct.
A: {a}
B: {b}
JSON: {{"verdict": "duplicate"|"distinct", "reason": ""}}'''
    resp = client.chat.completions.create(
        model='gpt-4o-mini',
        response_format={'type': 'json_object'},
        messages=[{'role': 'user', 'content': prompt}],
    )
    return json.loads(resp.choices[0].message.content)

Notes that matter in production:

Define "duplicate" for your domain in the prompt — the size/color/version line above is doing the real work; tickets, products, and articles each need their own definition of "same thing".

Batch the gray zone through the Batch API — adjudication is never urgent.

Cluster, don't just pair: union-find over confirmed-duplicate pairs gives you groups; keep the best exemplar per group (longest, most complete, or LLM-picked) and map the rest to it.

Where this pipeline earns its keep

RAG index hygiene: duplicate chunks waste retrieval slots and make answers repetitive — dedup before embedding into your store (semantic search guide).

Support/CRM data: ticket and account dedup with entity-aware judging ("J. Smith, Acme" vs "John Smith, Acme Corp") — same funnel, with enrichment normalizing fields first.

Training/eval data: near-dupes between train and eval sets inflate metrics — deduplicate *across* splits, not just within.

Operational notes

Provenance: log every merge (pair, verdict, reason, model version) — merges are destructive-ish; you want an undo trail.

Incremental mode: new records only compare against existing cluster exemplars — keeps daily runs cheap.

Audit the auto-merge band quarterly: thresholds drift as your data distribution changes; re-run the 200-pair calibration when sources change.

FAQ

Cost ballpark? Embeddings are fractions of a cent per thousand short records; LLM adjudication only touches the gray band. A 100K-record dedup typically lands in single-digit dollars — the design above is *why*.

Can I skip embeddings and hash only? For format-identical data, maybe; you'll miss every paraphrase. The embedding stage is what finds "same meaning, different words".

Cross-lingual duplicates? Multilingual embedding models cluster translations together — the same funnel works; just confirm your Stage-3 prompt states that translations count as duplicates (or not) for your use case.

*Last updated: June 2026.*

Also available in 中文.