AI Privacy & Data Protection: GDPR Compliance with Machine Learning in 2025
Navigate data privacy regulations while leveraging AI - practical compliance strategies
AI Privacy & Data Protection: GDPR-Compliant LLM Systems
AI thrives on data; privacy law demands minimization, purpose limitation, and deletability. Reconciling the two is now a *design* problem, not a legal afterthought — especially since the EU AI Act entered into force alongside GDPR, adding AI-specific obligations on top of data protection. This guide is the engineer's view: what the rules actually require of an LLM system, the architecture patterns that satisfy them, and the vendor questions that matter.
*(Engineering guidance, not legal advice — validate your specific processing with counsel/DPO.)*
The four GDPR pressure points for LLM systems
The AI Act layers on: risk-classification of your use case (HR screening, credit, biometrics → high-risk obligations: documentation, human oversight, logging), transparency duties (users must know they're talking to AI), and prohibited practices. Map your feature against the risk tiers early — retrofitting high-risk controls is expensive.
Architecture patterns that make compliance tractable
Pattern 1: PII redaction before the model
Strip or pseudonymize identifiers before the prompt leaves your boundary:
python
Sketch: redact before, re-hydrate after
import presidio_analyzer, presidio_anonymizer # Microsoft Presidio — the standard OSS choiceanalyzer = presidio_analyzer.AnalyzerEngine()
anonymizer = presidio_anonymizer.AnonymizerEngine()
def redact(text: str):
results = analyzer.analyze(text=text, language='en') # finds names, emails, phones, IBANs...
return anonymizer.anonymize(text=text, analyzer_results=results).text
prompt = redact(f'Draft a reply to this ticket:\n{ticket_body}')
LLM never sees real identifiers; re-insert via your own mapping if needed
This single pattern defuses most of minimization + processor-risk at once. NER-based redaction isn't perfect (names in odd formats slip through) — combine with deny-list rules for your known identifier formats, and measure leakage on a test set.
Pattern 2: Local/EU-resident inference for sensitive classes
For special-category data (health, etc.) or strict residency: run open-weights models inside your boundary — Ollama for modest scale, vLLM for volume. Data never leaves; the processor problem disappears (you keep the controller obligations). Hybrid routing is the pragmatic norm: sensitive intents → local model, generic ones → cloud API (multi-provider routing).
Pattern 3: RAG with deletable stores instead of training on personal data
Never fine-tune on personal data you may have to erase — weights have no delete button. Keep personal data in retrieval stores (pgvector) keyed to data subjects, so an erasure request is a DELETE WHERE subject_id = ? over rows *and their embeddings*. Embeddings of personal data are personal data — design the store so you can find them by subject.
Pattern 4: The audit spine
Log per request: purpose tag, lawful-basis tag, model+version, redaction applied, and token usage — without logging the payload itself (or with strict TTL if you must for debugging). This one table answers DPIA questions, subject-access requests, and the AI Act's logging duties simultaneously.
Vendor due diligence: the five questions
Implementation checklist
FAQ
Are embeddings personal data? If derivable from/linkable to a person — treat them as yes. Regulator guidance trends that way; design for deletability.
Can users opt out of AI processing? If your basis is legitimate interest, they can object — build the bypass path (human-only handling) before you need it.
Does anonymization free me from GDPR? True anonymization does — but LLM-context "anonymization" is usually pseudonymization (re-identifiable), which stays in scope. Be honest about which you have.
*Last updated: June 2026. Regulation and provider terms evolve — confirm against current official texts and your counsel.*
Also available in 中文.