LLM Text Deduplication: Practical Tutorial
Using AI embeddings to deduplicate large text datasets
LLM Text Deduplication: Practical Tutorial
Duplicates in text datasets — support tickets filed twice, the same product listed five ways, near-identical articles — break analytics, bloat RAG indexes, and skew training data. Exact-match dedup catches none of the interesting cases ("Can't log in to my account" vs "login isnt working for me"). The production answer is a three-stage funnel: cheap hashing → embedding similarity → LLM adjudication, using the expensive tool only where it's needed.
Why the funnel shape
Comparing all pairs is O(n²) — 1M records is 500 *billion* pairs; you cannot LLM (or even embed-compare) your way through that naively. Each stage cuts candidates by orders of magnitude:
Stage 2: embeddings + ANN
python
import numpy as np
from openai import OpenAIclient = OpenAI()
def embed_batch(texts: list[str]) -> np.ndarray:
resp = client.embeddings.create(model='text-embedding-3-small', input=texts)
return np.array([d.embedding for d in resp.data])
For real scale, put vectors in pgvector/FAISS and query neighbors;
in-memory cosine works to ~50K records:
emb = embed_batch(records)
emb /= np.linalg.norm(emb, axis=1, keepdims=True)
sims = emb @ emb.T
cand_pairs = [(i, j) for i in range(len(records)) for j in range(i+1, len(records))
if sims[i, j] > 0.85] # threshold: see calibration below
Threshold calibration is the whole game here: label ~200 pairs across similarity bands once, then pick two cutoffs — above ~0.95 auto-merge, below ~0.80 auto-distinct, and only the band between goes to Stage 3. That band is typically a few percent of pairs, which is what makes LLM adjudication affordable. (Store vectors in pgvector and this doubles as your search index.)
Stage 3: LLM adjudication
python
def judge(a: str, b: str) -> dict:
prompt = f'''Are these two records duplicates (same real-world thing), or distinct?
Variations in wording/typos/format = duplicate. Different size/color/version/intent = distinct.
A: {a}
B: {b}
JSON: {{"verdict": "duplicate"|"distinct", "reason": ""}}'''
resp = client.chat.completions.create(
model='gpt-4o-mini',
response_format={'type': 'json_object'},
messages=[{'role': 'user', 'content': prompt}],
)
return json.loads(resp.choices[0].message.content)
Notes that matter in production:
Where this pipeline earns its keep
Operational notes
FAQ
Cost ballpark? Embeddings are fractions of a cent per thousand short records; LLM adjudication only touches the gray band. A 100K-record dedup typically lands in single-digit dollars — the design above is *why*.
Can I skip embeddings and hash only? For format-identical data, maybe; you'll miss every paraphrase. The embedding stage is what finds "same meaning, different words".
Cross-lingual duplicates? Multilingual embedding models cluster translations together — the same funnel works; just confirm your Stage-3 prompt states that translations count as duplicates (or not) for your use case.
*Last updated: June 2026.*
Also available in 中文.