← Back to tutorials

LLM for Data Enrichment: Practical Tutorial

Enriching sparse data records with AI-generated content

LLM Data Enrichment: Practical Tutorial

Data enrichment — filling the gaps in sparse records — used to mean buying third-party data or manual research. LLMs changed the economics: classify free-text fields, normalize messy values, generate descriptions, infer categories — at cents per thousand records. This tutorial builds a production enrichment pipeline and is explicit about the line between derivation (safe) and fabrication (the failure mode that poisons datasets).

The decision rule that keeps you safe

An LLM can reliably enrich what is derivable from the record itself plus general world knowledge:

✅ Derivable (do it)❌ Fabrication risk (don't)

Categorize a product from its title/description"Find" a company's revenue or employee count Normalize "NY", "new york", "NYC"New YorkGuess a person's email/phone Extract attributes mentioned in free textInfer attributes *not present* ("probably enterprise tier") Standardize job titles to a taxonomyAssign seniority no signal supports Translate/summarize existing fields"Enrich" with facts needing a live lookup

For genuinely external facts, enrichment needs retrieval (web search/your KB) feeding the model — generation alone will produce confident inventions.

The pipeline

python
import json, asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI() sem = asyncio.Semaphore(10)

TAXONOMY = ['electronics', 'home-garden', 'apparel', 'toys', 'sports', 'other']

async def enrich(record: dict) -> dict: prompt = f'''Given this product record, return JSON: {{"category": one of {TAXONOMY}, "attributes": {{extracted key-values actually present in the text}}, "normalized_brand": cleaned brand name or null, "confidence": "high"|"medium"|"low"}} Rules: only use information present in the record. If unsure, use null + low confidence. Record: {json.dumps(record, ensure_ascii=False)}''' async with sem: resp = await client.chat.completions.create( model='gpt-4o-mini', # enrichment is mini-tier work response_format={'type': 'json_object'}, messages=[{'role': 'user', 'content': prompt}], ) out = json.loads(resp.choices[0].message.content) out['_source'] = 'llm_v3' # provenance — see below return {record, out}

async def run(records): return await asyncio.gather(*(enrich(r) for r in records))

Engineering notes baked into that snippet:

  • Closed vocabularies: category is an enum, not free text — free-text categories drift into near-duplicates ("Electronics", "electronic", "E-lectronics") and defeat the purpose. Schema-validate and reject out-of-vocabulary values (validation guide).
  • Confidence + null over guessing — the prompt explicitly licenses uncertainty, which measurably reduces fabrication.
  • Async with a semaphore for thousands of records (sync vs async); for millions and no deadline, the Batch API at 50% off is the right tool.
  • Mini-tier models: enrichment tasks are exactly what budget tiers are for — benchmark before paying flagship prices.
  • The three production disciplines

    1. Provenance columns. Every enriched field carries _source (model+prompt version) so you can re-run, audit, or bulk-invalidate a bad batch. Enriched data without provenance becomes indistinguishable from ground truth — that's how datasets rot.

    2. Sample-based QA, forever. Eyeball 100 random outputs before full runs (catches prompt bugs), then continuously audit ~1% with a second pass (different model or human). Track disagreement rate as your quality metric — when it jumps after a model/prompt change, you have a regression gate (eval workflow).

    3. Idempotent re-runs. Key enrichment by (record_id, prompt_version): re-running only processes changed records or new prompt versions. This turns enrichment from a one-off script into maintainable infrastructure.

    Beyond classification: the embedding combo

    Deduplication and entity resolution ("are these two records the same vendor?") work better with embeddings + LLM verification than with LLM alone — embed records, cluster near-neighbors cheaply, then have the LLM adjudicate only candidate pairs. That pattern is its own guide: LLM text deduplication.

    FAQ

    How accurate is category classification? With a clear taxonomy and a few-shot prompt, mini-tier models commonly hit the high-90s on clean e-commerce-style data — *your* number requires your eval set; build the 200-record golden set first.

    PII concerns? Records with personal data flowing to an API are processing under GDPR — redaction and vendor terms apply (compliance guide).

    LLM vs rules? Keep regex/rules for the deterministic 80% (formats, exact mappings); spend LLM calls on the messy remainder. Hybrid pipelines are cheaper and more debuggable than LLM-everything.


    *Last updated: June 2026.*

    Also available in 中文.