← Back to tutorials

OpenAI Batch vs Standard API: Side-by-Side Comparison

Cost and throughput tradeoffs in OpenAI API modes — comparing batch processing across openai and python

OpenAI Batch vs Standard API: Side-by-Side Comparison

One sentence: if the work doesn't need an answer within the hour, the Batch API does the same inference at half the price — the trade is synchronous seconds versus a completion window of up to 24 hours. For embeddings backfills, nightly classification, dataset generation, and bulk summarization, it's the easiest 50% cost cut in AI engineering.

At a glance

Standard APIBatch API

LatencySeconds, synchronousUp to 24h window (often much faster, not guaranteed) PriceList price50% off inputs and outputs (per OpenAI's pricing page) InterfaceOne HTTPS request per callUpload JSONL file of requests → poll → download results Rate limitsYour tier's RPM/TPMSeparate, much higher per-batch quotas (per-model caps apply) Streaming✅❌ Scale per submission1 requestUp to 50,000 requests / 200 MB per file

Two non-obvious wins beyond price: batch quotas are separate from your interactive rate limits (a backfill won't starve your production traffic), and the file-based flow forces idempotent, resumable job design that ad-hoc scripts usually lack.

The complete flow

python
import json, time
from openai import OpenAI

client = OpenAI()

1. Build the JSONL — one request per line, custom_id is how you join results back

with open('batch.jsonl', 'w') as f: for doc in documents: f.write(json.dumps({ 'custom_id': doc['id'], 'method': 'POST', 'url': '/v1/chat/completions', 'body': { 'model': 'gpt-4o-mini', 'messages': [{'role': 'user', 'content': f'Classify sentiment: {doc["text"]}'}], 'max_tokens': 8, }, }) + '\n')

2. Upload + create the batch

batch_file = client.files.create(file=open('batch.jsonl', 'rb'), purpose='batch') batch = client.batches.create( input_file_id=batch_file.id, endpoint='/v1/chat/completions', completion_window='24h', )

3. Poll (or just check back later — webhooks aren't provided)

while (batch := client.batches.retrieve(batch.id)).status not in ('completed', 'failed', 'expired'): time.sleep(60)

4. Download and join on custom_id

results = {} for line in client.files.content(batch.output_file_id).text.splitlines(): r = json.loads(line) results[r['custom_id']] = r['response']['body']['choices'][0]['message']['content']

Details that bite in practice:

  • custom_id is mandatory and your only join key — results come back in arbitrary order. Use your own stable IDs.
  • Partial failure is normal: per-line errors land in a separate error_file_id. Always check it and design the job to resubmit only the failed custom_ids.
  • expired can happen under load — unfinished requests are returned (and you're only billed for completed ones); resubmit the remainder.
  • Supported endpoints include chat completions, embeddings, and responses — embeddings backfills are arguably the killer app.
  • When each wins

    Standard: anything user-facing, agent/tool loops where call N+1 depends on call N, anything needing streaming.

    Batch: nightly/offline classification and tagging, embeddings for a whole corpus, synthetic-data and eval-set generation, re-summarizing an archive, scheduled report generation. The mental shift: if your code path is "cron job → loop over rows → call the API", that loop should almost certainly be a batch file instead — and the async-with-semaphore pattern (sync vs async calls) is only the right tool when you need those results *today*.

    The hybrid that serious pipelines converge on: batch for the bulk backfill, standard API for the trickle of new items that can't wait for the next nightly run.

    Same idea at other providers

    This pattern is industry-standard now: Anthropic's Message Batches API offers the same 50% discount with a similar submit-and-poll flow, and Gemini's batch mode likewise. If you're multi-provider, an abstraction like LiteLLM helps keep batch plumbing uniform — see LiteLLM vs Portkey for LLM gateways. Pricing and caps move — verify current numbers on each provider's pricing page before committing a budget forecast.

    FAQ

    Is prompt caching still relevant in batches? Provider-dependent; with a shared system prompt across 50K requests it matters — structure prompts stable-prefix-first regardless (same principle as in our KV cache deep dive).

    Can I cancel a running batch? Yes — client.batches.cancel(id); completed requests are billed, the rest stop.

    How fast in practice? Often well under the 24h window — but it's a window, not an SLA. If "usually 2 hours, occasionally 23" breaks your pipeline, you need the standard API.


    *Last updated: June 2026. Verify current pricing and limits against OpenAI's docs.*

    Also available in 中文.

    OpenAI Batch vs Standard API: Side-by-Side Comparison | AI Skill Navigation | AI Skill Navigation