AI Recipe: Stream OpenAI responses with FastAPI

Step-by-step implementation: stream openai responses with fastapi

AI Recipe: Stream OpenAI Responses with FastAPI

A copy-paste recipe: FastAPI endpoint that streams LLM tokens to the browser over SSE, with the production details (flush behavior, disconnects, proxy buffering) already handled. Ten minutes start to finish.

The server

python
main.py
import json
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
from pydantic import BaseModel
app = FastAPI()
client = AsyncOpenAI()  # reads OPENAI_API_KEY
class ChatIn(BaseModel):
    prompt: str
@app.post('/chat/stream')
async def chat_stream(body: ChatIn, request: Request):
    async def gen():
        stream = await client.chat.completions.create(
            model='gpt-4o-mini',
            messages=[{'role': 'user', 'content': body.prompt}],
            stream=True,
        )
        async for chunk in stream:
            # Client gone? Stop paying for tokens.
            if await request.is_disconnected():
                break
            token = chunk.choices[0].delta.content or ''
            if token:
                yield f'data: {json.dumps({"token": token})}\n\n'
        yield 'data: [DONE]\n\n'    return StreamingResponse(
        gen(),
        media_type='text/event-stream',
        headers={
            'Cache-Control': 'no-cache',
            'X-Accel-Buffering': 'no',   # tell nginx not to buffer
        },
    )

Run: uvicorn main:app --reload. The pieces that matter:

async def + AsyncOpenAI — a sync client here would block the event loop and stall every other request on the server (the classic mistake; see sync vs async LLM calls).

request.is_disconnected() — without it, a user closing the tab leaves the generation running to completion on your bill.

data: ...\n\n framing — that blank line terminates each SSE event; forget it and the browser receives nothing.

JSON-wrap each token — raw tokens with newlines break SSE framing; json.dumps makes them safe.

The client

POST bodies don't work with the browser's native EventSource, so read the stream with fetch:

javascript
async function chat(prompt, onToken) {
  const resp = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt }),
  });
  const reader = resp.body.getReader();
  const decoder = new TextDecoder();
  let buf = '';
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buf += decoder.decode(value, { stream: true });
    const events = buf.split('\n\n');
    buf = events.pop();                       // keep incomplete tail
    for (const ev of events) {
      const data = ev.replace(/^data: /, '');
      if (data === '[DONE]') return;
      onToken(JSON.parse(data).token);
    }
  }
}chat('Explain SSE in two sentences', t => { output.textContent += t; });

The buffer-split dance matters: network chunks don't align with SSE event boundaries, so always split on \n\n and keep the remainder.

Production checklist

nginx: the X-Accel-Buffering: no header handles it per-response; otherwise set proxy_buffering off; for the route.

Serverless/PaaS: confirm your platform supports streaming responses — some buffer the whole body by default.

Errors mid-stream: HTTP status is already sent, so emit an error event instead — yield f'data: {json.dumps({"error": str(e)})}\n\n' inside a try/except around the loop, and handle it client-side.

Auth: it's a normal POST — your existing FastAPI auth dependencies apply unchanged.

Variations

Any provider: Anthropic's SDK streams the same way (async with client.messages.stream(...)), and any OpenAI-compatible endpoint — including local Ollama — works with just a base_url change.

Next.js instead of FastAPI: the Vercel AI SDK wraps this whole recipe — see Vercel AI SDK vs LangChain.js.

Why SSE and not WebSockets: one-directional token flow doesn't need a socket; deeper rationale in the SSE guide and streaming vs polling.

*Last updated: June 2026.*

Also available in 中文.