Synchronous vs Async LLM Calls: Side-by-Side Comparison
Performance comparison for concurrent LLM operations — comparing throughput across asyncio and httpx
Synchronous vs Async LLM Calls: Side-by-Side Comparison
The rule in one line: one call at a time → synchronous is fine; many calls in flight → async wins, enormously. An LLM request is seconds of pure I/O wait. Sync code burns those seconds doing nothing; async code uses them to run other requests. With 100 calls at ~4s each, sequential sync takes ~400s, async with concurrency 10 takes ~40s. Same API bill, 10× less wall-clock.
What this is *not* about: making a single call faster. Async does nothing for one call's latency — the model takes as long as it takes.
The same task, both ways
python
Synchronous — fine for scripts, CLIs, one-at-a-time flows
from openai import OpenAIclient = OpenAI()
def summarize(text: str) -> str:
resp = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': f'Summarize in one line: {text}'}]
)
return resp.choices[0].message.content
results = [summarize(doc) for doc in documents] # N * latency, sequential
python
Async — same work, concurrent
import asyncio
from openai import AsyncOpenAIclient = AsyncOpenAI()
sem = asyncio.Semaphore(10) # concurrency cap — see below
async def summarize(text: str) -> str:
async with sem:
resp = await client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': f'Summarize in one line: {text}'}]
)
return resp.choices[0].message.content
async def main():
return await asyncio.gather(*(summarize(d) for d in documents))
results = asyncio.run(main()) # ~N/10 * latency
Every major SDK ships both clients (OpenAI/AsyncOpenAI, Anthropic/AsyncAnthropic); in Node.js everything is async by default, so this decision mostly exists in Python.
The semaphore is not optional
asyncio.gather over 5,000 coroutines fires 5,000 simultaneous requests — you'll hit 429 rate limits instantly and possibly trip abuse detection. The semaphore caps in-flight requests. Size it from your rate limit: roughly (requests-per-minute limit / 60) × average request seconds, then back off from there. Add retry-with-backoff for the 429s that still slip through (the SDKs retry automatically; tune max_retries).
Decision table
async def routes + async clientMixing sync and async (the common bugs)
asyncio.to_thread().asyncio.run() inside a running loop (Jupyter, FastAPI) — raises RuntimeError. In notebooks just await main() directly.ThreadPoolExecutor with the sync client also gets you concurrency and is a reasonable bridge in legacy codebases — but threads cost more memory and the pool size caps you. For LLM-scale I/O concurrency, asyncio is the cleaner fit.Streaming is orthogonal
Sync vs async is about *how many requests overlap*; streaming is about *receiving one response incrementally*. They compose — an async handler can stream tokens to the browser while other requests proceed. For the delivery mechanics see streaming vs polling for LLMs and the SSE implementation guide.
FAQ
Does async reduce cost? No — same tokens, same bill. It reduces wall-clock and lets one server handle more concurrent users (which can reduce *infra* cost).
How high can I set concurrency? Up to your provider rate limit tier. Past ~50-100 concurrent requests, also watch connection-pool limits in the HTTP client.
Asyncio vs multiprocessing? LLM calls are I/O-bound, not CPU-bound — multiprocessing buys nothing and costs serialization overhead. Reserve processes for genuinely CPU-heavy pre/post-processing.
*Last updated: June 2026.*
Also available in 中文.