← Back to tutorials

gRPC AI Services: Complete Integration Guide

High-performance AI services with gRPC protocol buffers

gRPC for AI Services: Integration Guide

gRPC earns its place in an AI stack in exactly one situation: service-to-service inference traffic inside your own infrastructure — a recommendation service calling your model server ten thousand times a minute, a pipeline of ML microservices passing tensors around. For browser-facing AI chat, plain HTTP + SSE remains the right call (browsers can't speak native gRPC). This guide builds a streaming LLM inference service in gRPC and is honest about where the wins are.

Why gRPC for internal AI traffic

  • Server streaming is first-class — token streaming maps directly onto a stream response, no SSE framing hacks.
  • Protobuf contracts — request/response shapes are compiler-checked in every language touching the service; no drifting JSON.
  • HTTP/2 multiplexing + binary encoding — meaningfully lower overhead than JSON-over-HTTP/1.1 when calls are small and frequent (per-call savings are milliseconds; they matter at thousands of QPS, not at 10).
  • Deadlines built intimeout propagates through call chains, which is exactly what you want wrapping a model that can hang.
  • The contract

    protobuf
    // llm.proto
    syntax = "proto3";
    package llm.v1;

    service LLMService { // Unary — classification, embeddings, short completions rpc Complete(CompleteRequest) returns (CompleteResponse); // Server streaming — token-by-token generation rpc CompleteStream(CompleteRequest) returns (stream TokenChunk); }

    message CompleteRequest { string model = 1; repeated Message messages = 2; int32 max_tokens = 3; } message Message { string role = 1; // "user" | "assistant" | "system" string content = 2; } message CompleteResponse { string text = 1; Usage usage = 2; } message TokenChunk { string token = 1; bool done = 2; Usage usage = 3; // populated on the final chunk } message Usage { int32 input_tokens = 1; int32 output_tokens = 2; }

    Generate stubs with python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. llm.proto.

    The server — wrapping a real backend

    A common production shape: the gRPC service is a *facade* over vLLM, Ollama, or a cloud provider, adding your auth, logging, and routing. Async implementation:

    python
    import grpc.aio
    from openai import AsyncOpenAI
    import llm_pb2, llm_pb2_grpc

    upstream = AsyncOpenAI(base_url='http://vllm:8000/v1', api_key='-') # vLLM's OpenAI-compatible endpoint

    class LLMService(llm_pb2_grpc.LLMServiceServicer): async def CompleteStream(self, request, context): stream = await upstream.chat.completions.create( model=request.model, messages=[{'role': m.role, 'content': m.content} for m in request.messages], max_tokens=request.max_tokens or 1024, stream=True, ) async for chunk in stream: token = chunk.choices[0].delta.content or '' if token: yield llm_pb2.TokenChunk(token=token, done=False) yield llm_pb2.TokenChunk(token='', done=True)

    async def serve(): server = grpc.aio.server() llm_pb2_grpc.add_LLMServiceServicer_to_server(LLMService(), server) server.add_insecure_port('[::]:50051') # mTLS in production await server.start() await server.wait_for_termination()

    (Serving the model itself is its own topic — see vLLM and TensorRT-LLM inference optimization.)

    The client — deadlines are the point

    python
    async with grpc.aio.insecure_channel('llm-service:50051') as channel:
        stub = llm_pb2_grpc.LLMServiceStub(channel)
        call = stub.CompleteStream(
            llm_pb2.CompleteRequest(
                model='qwen2.5-coder:32b',
                messages=[llm_pb2.Message(role='user', content='Explain HTTP/2 multiplexing')],
            ),
            timeout=60.0,   # hard deadline — propagates to the server as gRPC deadline
        )
        async for chunk in call:
            print(chunk.token, end='', flush=True)
    

    If the deadline expires mid-generation the client gets DEADLINE_EXCEEDED and — crucially — the server sees the cancellation and can stop the (expensive) upstream generation. Check context.cancelled() in long server loops and cancel the upstream call; otherwise you keep paying for tokens nobody will receive.

    Production notes

  • Load balancing: gRPC's long-lived HTTP/2 connections defeat naive L4 balancers — all traffic sticks to one backend. Use an L7 proxy (Envoy, Linkerd) or client-side load balancing.
  • Browsers: need grpc-web + a proxy. At that point, for a chat UI, SSE is less machinery — see SSE streaming guide; for the general transport decision, streaming vs polling.
  • Message size: default max is 4MB — raise max_receive_message_length if you pass embeddings or images.
  • Retries: make unary endpoints idempotent and configure retry policy in the service config, not ad hoc in callers.
  • gRPC vs REST+SSE for AI, honestly

    gRPCREST + SSE

    Internal service-to-service✅ best fitworks Browser clients⚠️ grpc-web + proxy✅ native Contract enforcement✅ protobufOpenAPI (weaker in practice) Streaming✅ native, bidirectionalserver→client only Debuggabilitygrpcurl, reflectioncurl — every tool ever Team familiarityloweruniversal

    If you're not already running gRPC infrastructure, an LLM facade alone rarely justifies introducing it. If you are, the streaming + deadline semantics are a genuinely better fit than REST.

    FAQ

    Bidirectional streaming — when? Live conversational turns over one connection (voice agents, interactive sessions). Most request/response LLM traffic only needs server streaming.

    How do errors map? Use canonical codes: INVALID_ARGUMENT for bad prompts, RESOURCE_EXHAUSTED for rate limits (clients treat it as retryable-with-backoff), UNAVAILABLE for upstream outage.


    *Last updated: June 2026.*

    Also available in 中文.