LLM for Data Enrichment: Practical Tutorial
Enriching sparse data records with AI-generated content
LLM Data Enrichment: Practical Tutorial
Data enrichment — filling the gaps in sparse records — used to mean buying third-party data or manual research. LLMs changed the economics: classify free-text fields, normalize messy values, generate descriptions, infer categories — at cents per thousand records. This tutorial builds a production enrichment pipeline and is explicit about the line between derivation (safe) and fabrication (the failure mode that poisons datasets).
The decision rule that keeps you safe
An LLM can reliably enrich what is derivable from the record itself plus general world knowledge:
"NY", "new york", "NYC" → New YorkFor genuinely external facts, enrichment needs retrieval (web search/your KB) feeding the model — generation alone will produce confident inventions.
The pipeline
python
import json, asyncio
from openai import AsyncOpenAIclient = AsyncOpenAI()
sem = asyncio.Semaphore(10)
TAXONOMY = ['electronics', 'home-garden', 'apparel', 'toys', 'sports', 'other']
async def enrich(record: dict) -> dict:
prompt = f'''Given this product record, return JSON:
{{"category": one of {TAXONOMY},
"attributes": {{extracted key-values actually present in the text}},
"normalized_brand": cleaned brand name or null,
"confidence": "high"|"medium"|"low"}}
Rules: only use information present in the record. If unsure, use null + low confidence.
Record: {json.dumps(record, ensure_ascii=False)}'''
async with sem:
resp = await client.chat.completions.create(
model='gpt-4o-mini', # enrichment is mini-tier work
response_format={'type': 'json_object'},
messages=[{'role': 'user', 'content': prompt}],
)
out = json.loads(resp.choices[0].message.content)
out['_source'] = 'llm_v3' # provenance — see below
return {record, out}
async def run(records):
return await asyncio.gather(*(enrich(r) for r in records))
Engineering notes baked into that snippet:
category is an enum, not free text — free-text categories drift into near-duplicates ("Electronics", "electronic", "E-lectronics") and defeat the purpose. Schema-validate and reject out-of-vocabulary values (validation guide).The three production disciplines
1. Provenance columns. Every enriched field carries _source (model+prompt version) so you can re-run, audit, or bulk-invalidate a bad batch. Enriched data without provenance becomes indistinguishable from ground truth — that's how datasets rot.
2. Sample-based QA, forever. Eyeball 100 random outputs before full runs (catches prompt bugs), then continuously audit ~1% with a second pass (different model or human). Track disagreement rate as your quality metric — when it jumps after a model/prompt change, you have a regression gate (eval workflow).
3. Idempotent re-runs. Key enrichment by (record_id, prompt_version): re-running only processes changed records or new prompt versions. This turns enrichment from a one-off script into maintainable infrastructure.
Beyond classification: the embedding combo
Deduplication and entity resolution ("are these two records the same vendor?") work better with embeddings + LLM verification than with LLM alone — embed records, cluster near-neighbors cheaply, then have the LLM adjudicate only candidate pairs. That pattern is its own guide: LLM text deduplication.
FAQ
How accurate is category classification? With a clear taxonomy and a few-shot prompt, mini-tier models commonly hit the high-90s on clean e-commerce-style data — *your* number requires your eval set; build the 200-record golden set first.
PII concerns? Records with personal data flowing to an API are processing under GDPR — redaction and vendor terms apply (compliance guide).
LLM vs rules? Keep regex/rules for the deterministic 80% (formats, exact mappings); spend LLM calls on the messy remainder. Hybrid pipelines are cheaper and more debuggable than LLM-everything.
*Last updated: June 2026.*
Also available in 中文.