gRPC AI Services: Complete Integration Guide
High-performance AI services with gRPC protocol buffers
gRPC for AI Services: Integration Guide
gRPC earns its place in an AI stack in exactly one situation: service-to-service inference traffic inside your own infrastructure — a recommendation service calling your model server ten thousand times a minute, a pipeline of ML microservices passing tensors around. For browser-facing AI chat, plain HTTP + SSE remains the right call (browsers can't speak native gRPC). This guide builds a streaming LLM inference service in gRPC and is honest about where the wins are.
Why gRPC for internal AI traffic
stream response, no SSE framing hacks.timeout propagates through call chains, which is exactly what you want wrapping a model that can hang.The contract
protobuf
// llm.proto
syntax = "proto3";
package llm.v1;service LLMService {
// Unary — classification, embeddings, short completions
rpc Complete(CompleteRequest) returns (CompleteResponse);
// Server streaming — token-by-token generation
rpc CompleteStream(CompleteRequest) returns (stream TokenChunk);
}
message CompleteRequest {
string model = 1;
repeated Message messages = 2;
int32 max_tokens = 3;
}
message Message {
string role = 1; // "user" | "assistant" | "system"
string content = 2;
}
message CompleteResponse {
string text = 1;
Usage usage = 2;
}
message TokenChunk {
string token = 1;
bool done = 2;
Usage usage = 3; // populated on the final chunk
}
message Usage {
int32 input_tokens = 1;
int32 output_tokens = 2;
}
Generate stubs with python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. llm.proto.
The server — wrapping a real backend
A common production shape: the gRPC service is a *facade* over vLLM, Ollama, or a cloud provider, adding your auth, logging, and routing. Async implementation:
python
import grpc.aio
from openai import AsyncOpenAI
import llm_pb2, llm_pb2_grpcupstream = AsyncOpenAI(base_url='http://vllm:8000/v1', api_key='-') # vLLM's OpenAI-compatible endpoint
class LLMService(llm_pb2_grpc.LLMServiceServicer):
async def CompleteStream(self, request, context):
stream = await upstream.chat.completions.create(
model=request.model,
messages=[{'role': m.role, 'content': m.content} for m in request.messages],
max_tokens=request.max_tokens or 1024,
stream=True,
)
async for chunk in stream:
token = chunk.choices[0].delta.content or ''
if token:
yield llm_pb2.TokenChunk(token=token, done=False)
yield llm_pb2.TokenChunk(token='', done=True)
async def serve():
server = grpc.aio.server()
llm_pb2_grpc.add_LLMServiceServicer_to_server(LLMService(), server)
server.add_insecure_port('[::]:50051') # mTLS in production
await server.start()
await server.wait_for_termination()
(Serving the model itself is its own topic — see vLLM and TensorRT-LLM inference optimization.)
The client — deadlines are the point
python
async with grpc.aio.insecure_channel('llm-service:50051') as channel:
stub = llm_pb2_grpc.LLMServiceStub(channel)
call = stub.CompleteStream(
llm_pb2.CompleteRequest(
model='qwen2.5-coder:32b',
messages=[llm_pb2.Message(role='user', content='Explain HTTP/2 multiplexing')],
),
timeout=60.0, # hard deadline — propagates to the server as gRPC deadline
)
async for chunk in call:
print(chunk.token, end='', flush=True)
If the deadline expires mid-generation the client gets DEADLINE_EXCEEDED and — crucially — the server sees the cancellation and can stop the (expensive) upstream generation. Check context.cancelled() in long server loops and cancel the upstream call; otherwise you keep paying for tokens nobody will receive.
Production notes
max_receive_message_length if you pass embeddings or images.gRPC vs REST+SSE for AI, honestly
If you're not already running gRPC infrastructure, an LLM facade alone rarely justifies introducing it. If you are, the streaming + deadline semantics are a genuinely better fit than REST.
FAQ
Bidirectional streaming — when? Live conversational turns over one connection (voice agents, interactive sessions). Most request/response LLM traffic only needs server streaming.
How do errors map? Use canonical codes: INVALID_ARGUMENT for bad prompts, RESOURCE_EXHAUSTED for rate limits (clients treat it as retryable-with-backoff), UNAVAILABLE for upstream outage.
*Last updated: June 2026.*
Also available in 中文.