Named Entity Recognition: Complete Implementation

Building production NER systems with LLMs and spaCy

Named Entity Recognition with LLMs: Complete Implementation

NER — pulling people, organizations, locations, dates, amounts out of text — used to mean training a spaCy/BERT model per entity schema. LLMs changed the build: define your entities in a prompt, get structured extractions immediately, no training data required. This guide implements production LLM-NER, covers when classic models still win, and the evaluation you need either way.

LLM-NER vs classic NER: the real trade

LLM promptingClassic (spaCy/fine-tuned BERT)

SetupMinutes (a prompt)Days+ (training data, training) Custom entity typesTrivial — describe themEach type needs labeled examples Accuracy on standard typesStrongStrong (and consistent) Niche domain types ("drug dosage", "clause type")Wins — zero-shot from descriptionNeeds hundreds of labels Cost at huge volumePer-token, adds upNear-zero after training LatencyAPI-boundMilliseconds local Character offsetsUnreliableNative and exact

Rules of thumb: LLM for custom schemas, low/medium volume, fast iteration; classic for standard entities at massive scale or when you need exact character spans (highlighting, redaction overlays). The hybrid below covers the rest.

Production implementation

python
import json
from openai import OpenAI
client = OpenAI()
NER_PROMPT = '''Extract entities from the text. Return JSON:
{"entities": [{"text": "",
               "type": "PERSON" | "ORG" | "LOCATION" | "DATE" | "MONEY" | "PRODUCT",
               "context": "<5-word surrounding snippet>"}]}
Rules:
"text" must be copied EXACTLY from the input (same casing/spacing).
Only the listed types. No interpretation: extract "next Tuesday" as-is, don't resolve it.
Empty list if none. JSON only.
Text: {input}'''def extract_entities(text: str) -> list[dict]:
    resp = client.chat.completions.create(
        model='gpt-4o-mini',                          # NER is mini-tier work
        response_format={'type': 'json_object'},
        messages=[{'role': 'user', 'content': NER_PROMPT.replace('{input}', text)}],
    )
    ents = json.loads(resp.choices[0].message.content)['entities']
    # The verification step that makes LLM-NER trustworthy:
    verified = []
    for e in ents:
        idx = text.find(e['text'])
        if idx >= 0:                                   # exact-match grounding
            verified.append({**e, 'start': idx, 'end': idx + len(e['text'])})
    return verified

The design choices that matter:

Exact-substring rule + text.find() verification — this kills the hallucinated-entity class entirely (anything not literally in the input gets dropped) *and* recovers character offsets, fixing LLM-NER's two classic weaknesses in one move. (Same string-grounding trick as contract analysis and OCR validation.)

Closed type list — open-ended typing drifts ("Company", "ORG", "organization"); enum it and validate (schema validation).

Mini-tier model + few-shot if needed: 3-5 examples of your domain's tricky cases in the prompt buys more accuracy than a model-tier upgrade.

Chunk long documents with overlap (entities at boundaries), dedupe by (text, type) after — and for volume runs, batch API the lot.

Beyond extraction: linking and normalization

Real pipelines rarely stop at spans:

Normalization ("next Tuesday" → date, "$2.3M" → 2300000): do it as a *second* pass over verified entities, where the model may interpret — keeping extraction literal and normalization separate makes errors debuggable.

Entity linking ("Apple" → the company vs the fruit; "J. Smith" = "John Smith"?): embedding similarity for candidates + LLM adjudication for the gray zone — the same funnel as deduplication.

Relation extraction ("PERSON works-for ORG") works in the same prompt shape once entity quality is proven — resist doing it before.

Evaluation (non-negotiable, 2 hours)

Label 100 representative documents (your real data, including the ugly ones). Score precision/recall per entity type — not overall (an aggregate hides "MONEY is 99%, PRODUCT is 60%"). Re-run on every prompt/model change (eval discipline). Typical findings: standard types land high-90s precision after the verification step; custom types start lower and respond to few-shot examples.

FAQ

GLiNER-style compact zero-shot NER models? A genuinely good middle option: zero-shot custom types like an LLM, runs local/cheap like classic — worth benchmarking when volume grows but schemas keep changing.

Multilingual? LLM-NER works across major languages in one prompt — verify the exact-substring rule survives tokenization quirks per language with your eval set.

PII redaction use case? This pipeline + offsets = redaction; for compliance-grade PII specifically, pair with a dedicated detector (Presidio) — two independent systems disagree usefully (privacy patterns).

*Last updated: June 2026.*

Also available in 中文.