← Back to tutorials

GraphQL AI Resolvers: Complete Integration Guide

AI-powered GraphQL API resolvers

GraphQL AI Resolvers: Integration Guide

Putting an LLM call inside a GraphQL resolver is easy. Doing it without wrecking your API's latency profile is the actual problem: a typical GraphQL query resolves in tens of milliseconds, an LLM call takes seconds. This guide covers the three patterns that handle that mismatch — synchronous resolvers (rarely right), subscriptions for token streaming, and mutation-plus-polling for long jobs — with working code.

The naive version (and when it's fine)

typescript
// Apollo Server resolver calling an LLM directly
const resolvers = {
  Query: {
    summarizeDocument: async (_, { docId }, { dataSources }) => {
      const doc = await dataSources.docs.get(docId);
      const completion = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: Summarize:\n${doc.text} }]
      });
      return completion.choices[0].message.content;
    }
  }
};

This blocks the whole query for seconds. It's acceptable only when the field is the *sole* thing requested and the client shows a spinner anyway. The trap: GraphQL lets clients combine fields freely, so someone will eventually put summarizeDocument in the same query as ten fast fields — and now your fast fields take four seconds too. If you keep synchronous AI fields, document them and consider a separate query type so they can't be combined casually.

Pattern 1: Subscriptions for token streaming

GraphQL subscriptions (over WebSocket via graphql-ws) are the native way to stream tokens:

typescript
import { PubSub } from 'graphql-subscriptions';
const pubsub = new PubSub();

const resolvers = { Mutation: { startChat: async (_, { prompt }) => { const sessionId = crypto.randomUUID(); // Fire and don't await — stream tokens via pubsub (async () => { const stream = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: prompt }], stream: true }); for await (const chunk of stream) { const token = chunk.choices[0]?.delta?.content ?? ''; if (token) pubsub.publish(CHAT_${sessionId}, { chatTokens: { token, done: false } }); } pubsub.publish(CHAT_${sessionId}, { chatTokens: { token: '', done: true } }); })(); return { sessionId }; } }, Subscription: { chatTokens: { subscribe: (_, { sessionId }) => pubsub.asyncIterator(CHAT_${sessionId}) } } };

The client calls startChat, gets a session ID, subscribes to chatTokens(sessionId), and renders tokens as they arrive. In production replace the in-memory PubSub with Redis pub/sub so it works across server instances.

Worth asking first, though: does this endpoint need to be GraphQL? If the AI chat is a standalone feature, a plain SSE endpoint is simpler and has better infra support — see streaming AI responses with SSE. Subscriptions earn their complexity when the streamed data must interleave with your existing graph (auth context, entity references, federation).

Pattern 2: Mutation + status polling for long jobs

For multi-second jobs where token-by-token display adds nothing (bulk summarization, embedding generation, report drafting):

graphql
type Mutation {
  requestAnalysis(input: AnalysisInput!): AnalysisJob!
}
type AnalysisJob {
  id: ID!
  status: JobStatus!   # PENDING | RUNNING | COMPLETE | FAILED
  result: Analysis     # null until COMPLETE
}

The mutation enqueues (BullMQ, Celery, pg-boss), a worker does the LLM call, the client polls job(id) or subscribes to a completion event. This also gives you retry, rate limiting, and cost accounting for free at the queue layer.

The N+1 problem, AI edition

GraphQL's classic N+1 becomes an N×cost problem with LLM fields. A list query with an AI field — products { aiDescription } — fires one LLM call *per item*. Defenses, in order of preference:

  • Cache aggressively. AI fields over stable inputs are perfect cache candidates: key on a hash of (model + prompt template version + input), store in Redis with a long TTL.
  • Batch via DataLoader — and actually batch at the LLM level: one call that processes 20 items (ask for a JSON array out, validate it — see Zod vs Pydantic for AI validation) instead of 20 calls.
  • Bound list sizes on any type with AI fields, and set query-complexity costs so an attacker can't make you spend $50 of inference with one query.
  • Cost and abuse controls

  • Assign AI fields a high complexity score in your query-cost plugin (graphql-cost-analysis or Apollo's built-ins) and set per-client budgets.
  • Per-user rate limits on AI mutations specifically, not just global API limits.
  • Log token usage per resolver with the request context attached — when the bill spikes, you want to know which field and which client.
  • FAQ

    Should the LLM call live in the resolver or a separate service? Thin resolvers, fat service. Resolvers handle graph mechanics; an AIService class owns prompts, retries, model fallbacks (fallback chains pattern) and caching. Testable, and reusable from REST/jobs too.

    @defer instead of subscriptions? @defer (incremental delivery) fits "fast fields now, one slow AI field later" — a single response with the AI part arriving late. It's per-fragment, not per-token, and client support is still uneven; test your stack before committing.

    Which model for resolver workloads? Latency-sensitive graph fields want small/fast models; quality-critical generation wants frontier ones — compare current options in the model library.


    *Last updated: June 2026.*

    Also available in 中文.