← Back to tutorials

Synchronous vs Async LLM Calls: Side-by-Side Comparison

Performance comparison for concurrent LLM operations — comparing throughput across asyncio and httpx

Synchronous vs Async LLM Calls: Side-by-Side Comparison

The rule in one line: one call at a time → synchronous is fine; many calls in flight → async wins, enormously. An LLM request is seconds of pure I/O wait. Sync code burns those seconds doing nothing; async code uses them to run other requests. With 100 calls at ~4s each, sequential sync takes ~400s, async with concurrency 10 takes ~40s. Same API bill, 10× less wall-clock.

What this is *not* about: making a single call faster. Async does nothing for one call's latency — the model takes as long as it takes.

The same task, both ways

python

Synchronous — fine for scripts, CLIs, one-at-a-time flows

from openai import OpenAI

client = OpenAI()

def summarize(text: str) -> str: resp = client.chat.completions.create( model='gpt-4o-mini', messages=[{'role': 'user', 'content': f'Summarize in one line: {text}'}] ) return resp.choices[0].message.content

results = [summarize(doc) for doc in documents] # N * latency, sequential

python

Async — same work, concurrent

import asyncio from openai import AsyncOpenAI

client = AsyncOpenAI() sem = asyncio.Semaphore(10) # concurrency cap — see below

async def summarize(text: str) -> str: async with sem: resp = await client.chat.completions.create( model='gpt-4o-mini', messages=[{'role': 'user', 'content': f'Summarize in one line: {text}'}] ) return resp.choices[0].message.content

async def main(): return await asyncio.gather(*(summarize(d) for d in documents))

results = asyncio.run(main()) # ~N/10 * latency

Every major SDK ships both clients (OpenAI/AsyncOpenAI, Anthropic/AsyncAnthropic); in Node.js everything is async by default, so this decision mostly exists in Python.

The semaphore is not optional

asyncio.gather over 5,000 coroutines fires 5,000 simultaneous requests — you'll hit 429 rate limits instantly and possibly trip abuse detection. The semaphore caps in-flight requests. Size it from your rate limit: roughly (requests-per-minute limit / 60) × average request seconds, then back off from there. Add retry-with-backoff for the 429s that still slip through (the SDKs retry automatically; tune max_retries).

Decision table

SituationChoiceWhy

CLI tool, notebook, one call per user actionSyncSimplicity wins; nothing to overlap Web backend handling concurrent usersAsyncOne sync LLM call blocks the worker for seconds — in FastAPI use async def routes + async client Batch-processing N documents nowAsync + semaphoreThe 10× wall-clock win above Batch processing that can wait hoursNeither — Batch API50% cheaper; see OpenAI Batch vs standard API Agent making sequential tool-dependent callsSync (or awaited async)Step k+1 needs step k's output — no concurrency to extract Fan-out: same prompt to multiple models, or map over chunksAsyncClassic parallel shape

Mixing sync and async (the common bugs)

  • Sync call inside an async web handler — blocks the entire event loop; every other request on the server stalls for the duration. This is the #1 production mistake. Use the async client, or wrap legacy sync code in asyncio.to_thread().
  • asyncio.run() inside a running loop (Jupyter, FastAPI) — raises RuntimeError. In notebooks just await main() directly.
  • Threads as an alternative: ThreadPoolExecutor with the sync client also gets you concurrency and is a reasonable bridge in legacy codebases — but threads cost more memory and the pool size caps you. For LLM-scale I/O concurrency, asyncio is the cleaner fit.
  • Streaming is orthogonal

    Sync vs async is about *how many requests overlap*; streaming is about *receiving one response incrementally*. They compose — an async handler can stream tokens to the browser while other requests proceed. For the delivery mechanics see streaming vs polling for LLMs and the SSE implementation guide.

    FAQ

    Does async reduce cost? No — same tokens, same bill. It reduces wall-clock and lets one server handle more concurrent users (which can reduce *infra* cost).

    How high can I set concurrency? Up to your provider rate limit tier. Past ~50-100 concurrent requests, also watch connection-pool limits in the HTTP client.

    Asyncio vs multiprocessing? LLM calls are I/O-bound, not CPU-bound — multiprocessing buys nothing and costs serialization overhead. Reserve processes for genuinely CPU-heavy pre/post-processing.


    *Last updated: June 2026.*

    Also available in 中文.