The AI System Design Interview: How to Answer
"Design a customer-support AI assistant" is now a standard senior-engineering interview question, and it's failable in a specific way: candidates who jump to "I'd use RAG with a vector database" fail; candidates who interrogate requirements, draw the boring boxes, and talk about evaluation and failure modes pass. This guide is the answer framework plus the two most common questions worked.
The framework (use it out loud)
Requirements first, AI second (5 min): Who uses it? Query volume? Latency tolerance? What's the cost of a wrong answer — annoyance or lawsuit? Accuracy bar and how it's measured? Data freshness needs? *The error-cost question is the senior signal* — it drives every later choice (model tier, human gates, eval rigor).
The non-AI skeleton (5 min): clients → API gateway → service layer → data stores → async workers. The AI is a component *inside* a normal system; interviewers fail people who draw "magic LLM box" architectures.
The AI layer, justified (15 min): model choice (and tier-routing), retrieval design, prompt/context strategy, structured outputs — every choice tied to a requirement from step 1.
Evaluation + operations (10 min): this is where offers are won — eval sets, regression gates, cost instrumentation, fallbacks, monitoring. Most candidates have nothing here; having *anything* differentiates.
Failure modes + iteration (5 min): hallucination handling, provider outage, prompt injection, drift. Close with "v1 ships X; we measure Y; v2 adds Z."Worked example 1: support assistant
Requirements: 10K tickets/day, seconds-latency for chat, wrong answers cost trust (medium-high) → citations + escape hatches mandatory.
Skeleton: chat frontend → API → orchestration service → RAG over help docs (pgvector — say *why*: existing Postgres, filters, one backup story) + scoped account-lookup tools (identity from auth, not from the prompt) → ticket-creation handoff.
AI layer: mini-tier model for intent routing; mid-tier for answer synthesis with mandatory citations; structured outputs for any action; streaming UX (SSE).
Eval/ops: 200 historical tickets as the eval set, judged on groundedness + resolution; deflection *and* satisfaction metrics; provider fallback; escalate-to-human always visible.
Failure modes: injection via ticket text (data-fencing), stale docs (version-aware ingestion), confident wrong answers (citation requirement + sampling audits).Worked example 2: "design an AI agent that does X"
The trap is agent-maximalism. The pass answer: start with a workflow, justify each increment of autonomy. Fixed pipeline if the steps are known (when not to use agents); add a planning loop only for genuinely dynamic tasks; put approval gates on consequential actions, tiered by risk; bound everything (steps, budget, time) and persist state for crash recovery (checkpointing). Mention cost: an unbounded agent is a billing incident.
Rapid-fire depth questions (and the one-line strong answers)
*"How do you handle hallucination?"* → "Layered: grounding + citation requirements, schema validation, programmatic verification where possible (quoted text must string-match), human gates scaled to error cost — and an eval set that measures the residual rate."
*"Vector DB choice?"* → "pgvector until scale forces otherwise — operational simplicity beats benchmark deltas at most scales; the graduation thresholds are tens of millions of vectors or strict p99s."
*"How do you keep costs sane?"* → "Tier-route by task, cache-friendly prompt structure, batch the offline work, per-feature cost instrumentation from day one."
*"Fine-tune or RAG?"* → "RAG for knowledge (deletable, updatable, auditable); fine-tune for form (style/format at volume) — and never on data subject to deletion requests."
*"How do you ship a model/prompt change safely?"* → "Registry + eval gate + canary — same rigor as a code deploy."What interviewers are actually scoring
Requirements interrogation (especially error-cost), justified trade-offs over name-dropping, the evaluation story, and operational maturity (fallbacks, graceful shutdown-grade thinking). Practice by designing twice: once free, once after writing five requirements down — the second design is always different, and *that delta is the skill being tested*.
*Last updated: June 2026.*