AI SaaS Architecture Patterns

Common architecture patterns for AI SaaS applications

AI SaaS Architecture Patterns

AI SaaS products share a recognizable architecture once you've seen a few — and the failures are equally patterned: margin death by token costs, one tenant's data in another's answer, a provider outage taking the product down. This guide catalogs the patterns that hold up, layer by layer.

The reference architecture

text
Clients (web/API)
   │
API layer ──── auth, rate limits, usage metering (per-tenant!)
   │
Orchestration service ── prompts, routing, tools, guardrails
   │                       │
AI gateway ── providers    ├── Retrieval layer (per-tenant stores)
(fallback, cost routing)   ├── Job queue (async/batch AI work)
                           └── Eval + observability spine

Three structural decisions define the rest:

1. The gateway seam. All model traffic flows through one layer that owns provider routing, fallback, caching, and *per-tenant cost attribution*. Without this seam, multi-provider, price negotiation, and "which customer is burning our margin?" are all rewrites.

2. Sync vs async split. Interactive features stream (SSE patterns); everything that can wait goes through a queue (and ideally batch APIs at 50% off). The classic startup mistake is running bulk work through the interactive path — it's 2× the cost and competes with user latency. Workers follow the webhook-processor discipline: idempotent, backpressured, shutdown-safe.

3. The config/registry seam. Prompts, model choices, and parameters resolve from a versioned registry, not code — prompt iteration decouples from deploys, rollback is a pointer flip, and every response logs its config version.

Multi-tenancy: the part you cannot retrofit

Tenant isolation in AI SaaS has three layers, each with a failure story behind it:

Retrieval isolation: tenant_id filters on every vector query at minimum (pgvector row-level patterns); separate collections/schemas for enterprise tiers; *test for cross-tenant leakage explicitly* — it's the AI-SaaS equivalent of an auth bypass.

Context isolation: prompts assembled per-request from per-tenant config; never cache assembled prompts across tenants; tenant data never enters shared few-shot examples.

Cost isolation: per-tenant token metering at the gateway — for billing, for abuse caps, and for noticing that one tenant's workload is 40% of your inference bill (it always is).

Enterprise tier additions: tenant-pinned data residency, zero-retention provider routes, DPAs, and bring-your-own-key (their API key, your software) — architecturally cheap *if* the gateway seam exists.

The margin layer (what makes AI SaaS economics work)

Token costs are a COGS line that classic SaaS doesn't have; the patterns that protect gross margin, in impact order:

Tier routing: mini-tier models for the 70% of calls that are classification/extraction/simple-chat; frontier only where evals prove the delta matters.

Caching: response caching on repeated questions; cache-friendly prompt structure (stable prefix first) for provider-side prompt-caching discounts.

Right-sized context: retrieval top-k discipline and history summarization — context bloat is the silent margin killer.

Batchable workloads batched (embeddings, enrichment, reports).

Usage-based pricing alignment: if your costs scale per-token but you price per-seat, one power user inverts your margin — meter and tier the product accordingly.

The trust layer

Guardrails at boundaries: schema validation on all outputs, injection-aware input handling, approval gates on consequential actions.

The eval spine: per-feature eval sets gating prompt/model changes (workflow); canary releases for anything user-visible.

Observability: traces + cost + quality metrics per tenant per feature (tool options) — "which tenants saw degraded answers yesterday?" must be a query, not an investigation.

Build order for a new AI SaaS

Week 1: skeleton + gateway seam + one feature streaming end-to-end. Week 2-3: registry, eval set, cost metering (before customers, not after). Then features. The seams cost days early and rewrites late — that's the whole pattern catalog in one sentence.

*Last updated: June 2026.*

Also available in 中文.