GraphQL AI Resolvers: Complete Integration Guide
AI-powered GraphQL API resolvers
GraphQL AI Resolvers: Integration Guide
Putting an LLM call inside a GraphQL resolver is easy. Doing it without wrecking your API's latency profile is the actual problem: a typical GraphQL query resolves in tens of milliseconds, an LLM call takes seconds. This guide covers the three patterns that handle that mismatch — synchronous resolvers (rarely right), subscriptions for token streaming, and mutation-plus-polling for long jobs — with working code.
The naive version (and when it's fine)
typescript
// Apollo Server resolver calling an LLM directly
const resolvers = {
Query: {
summarizeDocument: async (_, { docId }, { dataSources }) => {
const doc = await dataSources.docs.get(docId);
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: Summarize:\n${doc.text} }]
});
return completion.choices[0].message.content;
}
}
};
This blocks the whole query for seconds. It's acceptable only when the field is the *sole* thing requested and the client shows a spinner anyway. The trap: GraphQL lets clients combine fields freely, so someone will eventually put summarizeDocument in the same query as ten fast fields — and now your fast fields take four seconds too. If you keep synchronous AI fields, document them and consider a separate query type so they can't be combined casually.
Pattern 1: Subscriptions for token streaming
GraphQL subscriptions (over WebSocket via graphql-ws) are the native way to stream tokens:
typescript
import { PubSub } from 'graphql-subscriptions';
const pubsub = new PubSub();const resolvers = {
Mutation: {
startChat: async (_, { prompt }) => {
const sessionId = crypto.randomUUID();
// Fire and don't await — stream tokens via pubsub
(async () => {
const stream = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
stream: true
});
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content ?? '';
if (token) pubsub.publish(CHAT_${sessionId}, { chatTokens: { token, done: false } });
}
pubsub.publish(CHAT_${sessionId}, { chatTokens: { token: '', done: true } });
})();
return { sessionId };
}
},
Subscription: {
chatTokens: {
subscribe: (_, { sessionId }) => pubsub.asyncIterator(CHAT_${sessionId})
}
}
};
The client calls startChat, gets a session ID, subscribes to chatTokens(sessionId), and renders tokens as they arrive. In production replace the in-memory PubSub with Redis pub/sub so it works across server instances.
Worth asking first, though: does this endpoint need to be GraphQL? If the AI chat is a standalone feature, a plain SSE endpoint is simpler and has better infra support — see streaming AI responses with SSE. Subscriptions earn their complexity when the streamed data must interleave with your existing graph (auth context, entity references, federation).
Pattern 2: Mutation + status polling for long jobs
For multi-second jobs where token-by-token display adds nothing (bulk summarization, embedding generation, report drafting):
graphql
type Mutation {
requestAnalysis(input: AnalysisInput!): AnalysisJob!
}
type AnalysisJob {
id: ID!
status: JobStatus! # PENDING | RUNNING | COMPLETE | FAILED
result: Analysis # null until COMPLETE
}
The mutation enqueues (BullMQ, Celery, pg-boss), a worker does the LLM call, the client polls job(id) or subscribes to a completion event. This also gives you retry, rate limiting, and cost accounting for free at the queue layer.
The N+1 problem, AI edition
GraphQL's classic N+1 becomes an N×cost problem with LLM fields. A list query with an AI field — products { aiDescription } — fires one LLM call *per item*. Defenses, in order of preference:
Cost and abuse controls
FAQ
Should the LLM call live in the resolver or a separate service? Thin resolvers, fat service. Resolvers handle graph mechanics; an AIService class owns prompts, retries, model fallbacks (fallback chains pattern) and caching. Testable, and reusable from REST/jobs too.
@defer instead of subscriptions? @defer (incremental delivery) fits "fast fields now, one slow AI field later" — a single response with the AI part arriving late. It's per-fragment, not per-token, and client support is still uneven; test your stack before committing.
Which model for resolver workloads? Latency-sensitive graph fields want small/fast models; quality-critical generation wants frontier ones — compare current options in the model library.
*Last updated: June 2026.*
Also available in 中文.