LLM Security: Defending Against Jailbreaks and Prompt Injection Attacks
Constitutional prompts, output filtering, and layered defense strategies
LLM security requires layered defenses. Primary threats: 1) Prompt injection in RAG: malicious document content overrides instructions. Defense: use delimiters to clearly separate instructions from user/retrieved content, add instruction reminders after retrieved context, semantic similarity checks on retrieved content. 2) Direct jailbreaks: role-play, hypothetical framing, "forget previous instructions." Defense: system prompt hardening with explicit prohibition on overriding safety instructions, repetition of safety constraints, model fine-tuning for robustness. 3) Indirect injection: malicious content in external sources (emails, websites) that the agent processes. Defense: treat all external content as untrusted, implement content safety checks before and after LLM processing. 4) Data exfiltration: attempt to extract training data or sensitive context. Defense: never include sensitive data in prompts, output filtering for PII patterns. Layered defense architecture: Input validation (length, encoding, basic patterns) -> Context sanitization (escape special characters, mark as untrusted) -> LLM with hardened system prompt -> Output validation (content safety classifier) -> Response filtering (PII redaction, sensitive info detection). Security testing: automated red-teaming with HarmBench, Garak framework for LLM vulnerability scanning. Regular adversarial testing cadence. Incident response: log all inputs and outputs, implement rate limiting, have kill switch for model endpoints.
Also available in 中文.