← Back to tutorials

AI Privacy & Data Protection: GDPR Compliance with Machine Learning in 2025

Navigate data privacy regulations while leveraging AI - practical compliance strategies

AI Privacy & Data Protection: GDPR-Compliant LLM Systems

AI thrives on data; privacy law demands minimization, purpose limitation, and deletability. Reconciling the two is now a *design* problem, not a legal afterthought — especially since the EU AI Act entered into force alongside GDPR, adding AI-specific obligations on top of data protection. This guide is the engineer's view: what the rules actually require of an LLM system, the architecture patterns that satisfy them, and the vendor questions that matter.

*(Engineering guidance, not legal advice — validate your specific processing with counsel/DPO.)*

The four GDPR pressure points for LLM systems

  • Lawful basis & purpose limitation — you need a defined basis (consent, contract, legitimate interest) *per processing purpose*. "We send support tickets to an LLM for drafting replies" is a purpose; quietly also using them for analytics is a second one needing its own basis.
  • Data minimization — send the model what the task needs, not the whole record. Most prompts carrying full customer objects violate this for convenience.
  • Data subject rights — access, rectification, erasure. The hard question: where does personal data end up? Prompts in provider logs, embeddings in your vector store, fine-tuned weights. Each needs a deletion story.
  • Transfers & processors — an LLM API call is processing by a processor, usually with cross-border transfer questions. You need a DPA with the provider, and the provider's retention/training policies become *your* compliance surface.
  • The AI Act layers on: risk-classification of your use case (HR screening, credit, biometrics → high-risk obligations: documentation, human oversight, logging), transparency duties (users must know they're talking to AI), and prohibited practices. Map your feature against the risk tiers early — retrofitting high-risk controls is expensive.

    Architecture patterns that make compliance tractable

    Pattern 1: PII redaction before the model

    Strip or pseudonymize identifiers before the prompt leaves your boundary:

    python
    

    Sketch: redact before, re-hydrate after

    import presidio_analyzer, presidio_anonymizer # Microsoft Presidio — the standard OSS choice

    analyzer = presidio_analyzer.AnalyzerEngine() anonymizer = presidio_anonymizer.AnonymizerEngine()

    def redact(text: str): results = analyzer.analyze(text=text, language='en') # finds names, emails, phones, IBANs... return anonymizer.anonymize(text=text, analyzer_results=results).text

    prompt = redact(f'Draft a reply to this ticket:\n{ticket_body}')

    LLM never sees real identifiers; re-insert via your own mapping if needed

    This single pattern defuses most of minimization + processor-risk at once. NER-based redaction isn't perfect (names in odd formats slip through) — combine with deny-list rules for your known identifier formats, and measure leakage on a test set.

    Pattern 2: Local/EU-resident inference for sensitive classes

    For special-category data (health, etc.) or strict residency: run open-weights models inside your boundary — Ollama for modest scale, vLLM for volume. Data never leaves; the processor problem disappears (you keep the controller obligations). Hybrid routing is the pragmatic norm: sensitive intents → local model, generic ones → cloud API (multi-provider routing).

    Pattern 3: RAG with deletable stores instead of training on personal data

    Never fine-tune on personal data you may have to erase — weights have no delete button. Keep personal data in retrieval stores (pgvector) keyed to data subjects, so an erasure request is a DELETE WHERE subject_id = ? over rows *and their embeddings*. Embeddings of personal data are personal data — design the store so you can find them by subject.

    Pattern 4: The audit spine

    Log per request: purpose tag, lawful-basis tag, model+version, redaction applied, and token usage — without logging the payload itself (or with strict TTL if you must for debugging). This one table answers DPIA questions, subject-access requests, and the AI Act's logging duties simultaneously.

    Vendor due diligence: the five questions

  • Is API data used for training? (All major providers now default to no for API traffic — verify in the DPA, not the marketing page.)
  • Retention period for prompts/outputs, and is zero-retention available?
  • EU data residency option? Sub-processor list?
  • Will they sign your DPA / do they offer SCCs for transfers?
  • Certifications (SOC 2, ISO 27001) — table stakes for enterprise review.
  • Implementation checklist

  • [ ] Data map: which personal data reaches which model, for which purpose, on which basis
  • [ ] DPIA for anything systematic/at-scale (and AI Act risk classification)
  • [ ] Redaction layer + leakage tests in CI
  • [ ] Erasure path covering DB rows, vector embeddings, caches, provider-side retention
  • [ ] Transparency UX: AI disclosure + human escalation path
  • [ ] Vendor DPAs filed; retention settings actually configured (zero-retention flags don't set themselves)
  • FAQ

    Are embeddings personal data? If derivable from/linkable to a person — treat them as yes. Regulator guidance trends that way; design for deletability.

    Can users opt out of AI processing? If your basis is legitimate interest, they can object — build the bypass path (human-only handling) before you need it.

    Does anonymization free me from GDPR? True anonymization does — but LLM-context "anonymization" is usually pseudonymization (re-identifiable), which stays in scope. Be honest about which you have.


    *Last updated: June 2026. Regulation and provider terms evolve — confirm against current official texts and your counsel.*

    Also available in 中文.