← Back to tutorials

OpenAI API Best Practices: Production Guide

Production patterns for the OpenAI API including retries and rate limiting

OpenAI API Best Practices: Production Guide

The gap between "calls the OpenAI API" and "runs it in production" is a known checklist: timeout/retry behavior, structured outputs, cost instrumentation, and the failure modes you haven't hit yet. This guide is that checklist, with the code details that matter. (Most of it transfers verbatim to any provider — the Claude API comparison covers the deltas.)

Client configuration: the defaults are not production values

python
from openai import OpenAI, AsyncOpenAI

client = AsyncOpenAI( timeout=30.0, # default is much higher — bound your tail latency max_retries=3, # SDK retries 429/5xx with backoff automatically )

  • Async client in services — a sync call in an async web handler blocks the event loop (the full sync/async decision); concurrency capped with a semaphore.
  • Timeouts sized per route: 10-30s for interactive, longer only where streaming isn't possible.
  • One client instance per process (connection pooling), keys from env/secrets manager — never constructed per request.
  • Reliability patterns

  • Let the SDK retry transient errors; you handle the rest. 429/5xx → SDK backoff. 400s → your bug, don't retry. Content-filter outcomes → product decision, not retry.
  • Idempotency at your layer: LLM calls aren't idempotent (same input, different output) — for "exactly once" semantics (billing-adjacent actions), gate on your own request IDs, not on retry behavior.
  • Cross-provider fallback for incidents: provider status pages justify it monthly — the gateway architecture makes it config, and fallback chains cover the request-level pattern.
  • Stream anything long — both for UX (time-to-first-token) and to stay clear of HTTP timeouts on big outputs (streaming recipe). Check finish_reason: a length finish means truncation you must handle, not an answer.
  • Structured outputs: use the real feature

    Use schema-enforced structured outputs (not "please return JSON" prose, not legacy json-mode-and-pray):

    python
    from pydantic import BaseModel

    class Ticket(BaseModel): category: str urgency: str summary: str

    resp = client.chat.completions.parse( # SDK validates against the schema model='gpt-5-mini', messages=[{'role': 'user', 'content': f'Triage this ticket: {body}'}], response_format=Ticket, ) ticket = resp.choices[0].message.parsed

    Schema enforcement guarantees *shape*, not *sense* — semantic validation (does this ID exist? is the math right?) stays on you (validation guide).

    Cost engineering (the practices that actually move the bill)

  • Right-size the model per route — the single biggest lever. Classification/extraction on mini/nano tiers; frontier models only where evals prove they're needed.
  • Prompt caching: stable system prompt + tools first, volatile content last — repeated-prefix discounts are automatic but only if your prompt construction is cache-friendly (no timestamps/UUIDs early; same prefix discipline as self-hosted serving).
  • Batch API for anything that can wait — flat 50% off (when and how).
  • Instrument cost per feature from day one: log model, tokens (in/out/cached), latency, feature tag on every call. The bill spike you can't attribute is the one that hurts (observability options).
  • Cap max_tokens per route — runaway generation on a malformed prompt is a real cost incident class.
  • Security and correctness

  • Prompt injection is a when, not if: any user/web content entering prompts can carry instructions. Separate system/user roles strictly, treat model output as untrusted for downstream actions, and gate consequential tool calls on validation or human approval.
  • PII discipline: redact before sending where feasible; know your retention settings (GDPR engineering).
  • Pin model versions in production (gpt-5-2026-xx style snapshots where offered) and re-run your eval before moving pins — silent model drift breaks tuned prompts (prompt sensitivity).
  • Evals as regression gates: a 100-case eval suite wired into CI is the difference between "we think it still works" and knowing (workflow).
  • FAQ

    Chat Completions or Responses API? New builds: Responses (it's where new features land, and Assistants API users are migrating to it). Existing Chat Completions code keeps working — migrate opportunistically.

    Organization keys vs project keys? Project-scoped keys with per-project budgets/limits — blast-radius control when a key leaks or a service runs away.

    Rate limit headroom? Watch the rate-limit headers, alert at sustained >70% of tier, and request raises before launches, not during them.


    *Last updated: June 2026. Parameter names and model tiers move — verify against platform.openai.com/docs.*

    Also available in 中文.

    OpenAI API Best Practices: Production Guide | AI Skill Navigation | AI Skill Navigation