Named Entity Recognition: Complete Implementation
Building production NER systems with LLMs and spaCy
Named Entity Recognition with LLMs: Complete Implementation
NER — pulling people, organizations, locations, dates, amounts out of text — used to mean training a spaCy/BERT model per entity schema. LLMs changed the build: define your entities in a prompt, get structured extractions immediately, no training data required. This guide implements production LLM-NER, covers when classic models still win, and the evaluation you need either way.
LLM-NER vs classic NER: the real trade
Rules of thumb: LLM for custom schemas, low/medium volume, fast iteration; classic for standard entities at massive scale or when you need exact character spans (highlighting, redaction overlays). The hybrid below covers the rest.
Production implementation
python
import json
from openai import OpenAIclient = OpenAI()
NER_PROMPT = '''Extract entities from the text. Return JSON:
{"entities": [{"text": "",
"type": "PERSON" | "ORG" | "LOCATION" | "DATE" | "MONEY" | "PRODUCT",
"context": "<5-word surrounding snippet>"}]}
Rules:
"text" must be copied EXACTLY from the input (same casing/spacing).
Only the listed types. No interpretation: extract "next Tuesday" as-is, don't resolve it.
Empty list if none. JSON only.
Text: {input}'''
def extract_entities(text: str) -> list[dict]:
resp = client.chat.completions.create(
model='gpt-4o-mini', # NER is mini-tier work
response_format={'type': 'json_object'},
messages=[{'role': 'user', 'content': NER_PROMPT.replace('{input}', text)}],
)
ents = json.loads(resp.choices[0].message.content)['entities']
# The verification step that makes LLM-NER trustworthy:
verified = []
for e in ents:
idx = text.find(e['text'])
if idx >= 0: # exact-match grounding
verified.append({**e, 'start': idx, 'end': idx + len(e['text'])})
return verified
The design choices that matter:
text.find() verification — this kills the hallucinated-entity class entirely (anything not literally in the input gets dropped) *and* recovers character offsets, fixing LLM-NER's two classic weaknesses in one move. (Same string-grounding trick as contract analysis and OCR validation.)Beyond extraction: linking and normalization
Real pipelines rarely stop at spans:
Evaluation (non-negotiable, 2 hours)
Label 100 representative documents (your real data, including the ugly ones). Score precision/recall per entity type — not overall (an aggregate hides "MONEY is 99%, PRODUCT is 60%"). Re-run on every prompt/model change (eval discipline). Typical findings: standard types land high-90s precision after the verification step; custom types start lower and respond to few-shot examples.
FAQ
GLiNER-style compact zero-shot NER models? A genuinely good middle option: zero-shot custom types like an LLM, runs local/cheap like classic — worth benchmarking when volume grows but schemas keep changing.
Multilingual? LLM-NER works across major languages in one prompt — verify the exact-substring rule survives tokenization quirks per language with your eval set.
PII redaction use case? This pipeline + offsets = redaction; for compliance-grade PII specifically, pair with a dedicated detector (Presidio) — two independent systems disagree usefully (privacy patterns).
*Last updated: June 2026.*
Also available in 中文.