← Back to tutorials

Streaming AI Responses with Server-Sent Events: Complete Developer Guide 2026

Master Streaming AI Responses with Server-Sent Events with practical examples and production patterns

Streaming AI Responses with Server-Sent Events (2026)

Streaming makes an AI app feel fast: instead of waiting for the full answer, you show tokens as they're generated. Server-Sent Events (SSE) is the simplest transport for this — a one-way stream from server to browser over plain HTTP, perfect for LLM token streams. This guide shows the full path from model to browser.

Why SSE (not WebSockets)

LLM streaming is one-directional (server → client), so you don't need WebSockets' bidirectional complexity. SSE is just an HTTP response with Content-Type: text/event-stream that stays open and emits data: lines. Browsers reconnect automatically via the EventSource API.

Server (FastAPI)

python

pip install fastapi openai sse-starlette uvicorn

from fastapi import FastAPI from sse_starlette.sse import EventSourceResponse from openai import OpenAI

app = FastAPI() client = OpenAI()

@app.get("/chat") async def chat(q: str): async def gen(): stream = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": q}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: yield {"data": delta} yield {"event": "done", "data": "[DONE]"} return EventSourceResponse(gen())

Client (browser)

js
const es = new EventSource('/chat?q=' + encodeURIComponent(input));
es.onmessage = (e) => { output.textContent += e.data; };
es.addEventListener('done', () => es.close());

That's the whole loop: each token arrives as a message event and is appended to the UI.

Production notes

  • Disable proxy buffering (e.g. Nginx X-Accel-Buffering: no) or chunks get held back and the stream stutters.
  • Flush per token. Any buffering layer between the model and the client defeats streaming.
  • Handle disconnects. If the user navigates away, cancel the upstream model call to stop billing.
  • Next.js note: the Vercel AI SDK wraps all of this — streamText + useChat give you streaming without hand-writing SSE. See Vercel AI SDK vs LangChain.js.
  • Backpressure / fan-out: for many concurrent streams, serve behind an engine built for it — see LLM 推理优化.
  • FAQ

    SSE or WebSockets? SSE for one-way token streaming — simpler and auto-reconnects. WebSockets only if you need bidirectional realtime. Why does my stream arrive all at once? A buffering proxy. Disable buffering and flush per chunk. How do I stop billing on disconnect? Detect client disconnect and cancel the upstream completion.

    Summary

    SSE is the path of least resistance for LLM streaming: open a text/event-stream, yield tokens as they arrive, append them in the browser with EventSource. Disable buffering, flush per token, and cancel on disconnect. On Next.js, let the Vercel AI SDK handle it.


    *Last updated: June 2026. Verify streaming APIs against the OpenAI and framework docs.*

    Also available in 中文.