AI Recipe: Stream OpenAI responses with FastAPI
Step-by-step implementation: stream openai responses with fastapi
AI Recipe: Stream OpenAI Responses with FastAPI
A copy-paste recipe: FastAPI endpoint that streams LLM tokens to the browser over SSE, with the production details (flush behavior, disconnects, proxy buffering) already handled. Ten minutes start to finish.
The server
python
main.py
import json
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
from pydantic import BaseModelapp = FastAPI()
client = AsyncOpenAI() # reads OPENAI_API_KEY
class ChatIn(BaseModel):
prompt: str
@app.post('/chat/stream')
async def chat_stream(body: ChatIn, request: Request):
async def gen():
stream = await client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': body.prompt}],
stream=True,
)
async for chunk in stream:
# Client gone? Stop paying for tokens.
if await request.is_disconnected():
break
token = chunk.choices[0].delta.content or ''
if token:
yield f'data: {json.dumps({"token": token})}\n\n'
yield 'data: [DONE]\n\n'
return StreamingResponse(
gen(),
media_type='text/event-stream',
headers={
'Cache-Control': 'no-cache',
'X-Accel-Buffering': 'no', # tell nginx not to buffer
},
)
Run: uvicorn main:app --reload. The pieces that matter:
async def + AsyncOpenAI — a sync client here would block the event loop and stall every other request on the server (the classic mistake; see sync vs async LLM calls).request.is_disconnected() — without it, a user closing the tab leaves the generation running to completion on your bill.data: ...\n\n framing — that blank line terminates each SSE event; forget it and the browser receives nothing.json.dumps makes them safe.The client
POST bodies don't work with the browser's native EventSource, so read the stream with fetch:
javascript
async function chat(prompt, onToken) {
const resp = await fetch('/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
});
const reader = resp.body.getReader();
const decoder = new TextDecoder();
let buf = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buf += decoder.decode(value, { stream: true });
const events = buf.split('\n\n');
buf = events.pop(); // keep incomplete tail
for (const ev of events) {
const data = ev.replace(/^data: /, '');
if (data === '[DONE]') return;
onToken(JSON.parse(data).token);
}
}
}chat('Explain SSE in two sentences', t => { output.textContent += t; });
The buffer-split dance matters: network chunks don't align with SSE event boundaries, so always split on \n\n and keep the remainder.
Production checklist
X-Accel-Buffering: no header handles it per-response; otherwise set proxy_buffering off; for the route.yield f'data: {json.dumps({"error": str(e)})}\n\n' inside a try/except around the loop, and handle it client-side.Variations
async with client.messages.stream(...)), and any OpenAI-compatible endpoint — including local Ollama — works with just a base_url change.*Last updated: June 2026.*
Also available in 中文.