LLM Security: Defending Against Jailbreaks and Prompt Injection Attacks
Constitutional prompts, output filtering, and layered defense strategies
LLM Security: Defending Against Jailbreaks and Prompt Injection Attacks
Constitutional prompts, output filtering, and layered defense strategies
Comprehensive security guide for LLM applications covering prompt injection defense, jailbreak resistance, output filtering, and building secure AI systems that resist adversarial manipulation.
LLM security requires layered defenses. Primary threats: 1) Prompt injection in RAG: malicious document content overrides instructions. Defense: use delimiters to clearly separate instructions from user/retrieved content, add instruction reminders after retrieved context, semantic similarity checks on retrieved content. 2) Direct jailbreaks: role-play, hypothetical framing, "forget previous instructions." Defense: system prompt hardening with explicit prohibition on overriding safety instructions, repetition of safety constraints, model fine-tuning for robustness. 3) Indirect injection: malicious content in external sources (emails, websites) that the agent processes. Defense: treat all external content as untrusted, implement content safety checks before and after LLM processing. 4) Data exfiltration: attempt to extract training data or sensitive context. Defense: never include sensitive data in prompts, output filtering for PII patterns. Layered defense architecture: Input validation (length, encoding, basic patterns) -> Context sanitization (escape special characters, mark as untrusted) -> LLM with hardened system prompt -> Output validation (content safety classifier) -> Response filtering (PII redaction, sensitive info detection). Security testing: automated red-teaming with HarmBench, Garak framework for LLM vulnerability scanning. Regular adversarial testing cadence. Incident response: log all inputs and outputs, implement rate limiting, have kill switch for model endpoints.