GraphQL AI Resolvers: Complete Integration Guide

AI-powered GraphQL API resolvers

GraphQL AI Resolvers: Integration Guide

Putting an LLM call inside a GraphQL resolver is easy. Doing it without wrecking your API's latency profile is the actual problem: a typical GraphQL query resolves in tens of milliseconds, an LLM call takes seconds. This guide covers the three patterns that handle that mismatch — synchronous resolvers (rarely right), subscriptions for token streaming, and mutation-plus-polling for long jobs — with working code.

The naive version (and when it's fine)

typescript
// Apollo Server resolver calling an LLM directly
const resolvers = {
  Query: {
    summarizeDocument: async (_, { docId }, { dataSources }) => {
      const doc = await dataSources.docs.get(docId);
      const completion = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: Summarize:\n${doc.text} }]
      });
      return completion.choices[0].message.content;
    }
  }
};

This blocks the whole query for seconds. It's acceptable only when the field is the *sole* thing requested and the client shows a spinner anyway. The trap: GraphQL lets clients combine fields freely, so someone will eventually put summarizeDocument in the same query as ten fast fields — and now your fast fields take four seconds too. If you keep synchronous AI fields, document them and consider a separate query type so they can't be combined casually.

Pattern 1: Subscriptions for token streaming

GraphQL subscriptions (over WebSocket via graphql-ws) are the native way to stream tokens:

typescript
import { PubSub } from 'graphql-subscriptions';
const pubsub = new PubSub();const resolvers = {
  Mutation: {
    startChat: async (_, { prompt }) => {
      const sessionId = crypto.randomUUID();
      // Fire and don't await — stream tokens via pubsub
      (async () => {
        const stream = await openai.chat.completions.create({
          model: 'gpt-4o-mini',
          messages: [{ role: 'user', content: prompt }],
          stream: true
        });
        for await (const chunk of stream) {
          const token = chunk.choices[0]?.delta?.content ?? '';
          if (token) pubsub.publish(CHAT_${sessionId}, { chatTokens: { token, done: false } });
        }
        pubsub.publish(CHAT_${sessionId}, { chatTokens: { token: '', done: true } });
      })();
      return { sessionId };
    }
  },
  Subscription: {
    chatTokens: {
      subscribe: (_, { sessionId }) => pubsub.asyncIterator(CHAT_${sessionId})
    }
  }
};

The client calls startChat, gets a session ID, subscribes to chatTokens(sessionId), and renders tokens as they arrive. In production replace the in-memory PubSub with Redis pub/sub so it works across server instances.

Worth asking first, though: does this endpoint need to be GraphQL? If the AI chat is a standalone feature, a plain SSE endpoint is simpler and has better infra support — see streaming AI responses with SSE. Subscriptions earn their complexity when the streamed data must interleave with your existing graph (auth context, entity references, federation).

Pattern 2: Mutation + status polling for long jobs

For multi-second jobs where token-by-token display adds nothing (bulk summarization, embedding generation, report drafting):

graphql
type Mutation {
  requestAnalysis(input: AnalysisInput!): AnalysisJob!
}
type AnalysisJob {
  id: ID!
  status: JobStatus!   # PENDING | RUNNING | COMPLETE | FAILED
  result: Analysis     # null until COMPLETE
}

The mutation enqueues (BullMQ, Celery, pg-boss), a worker does the LLM call, the client polls job(id) or subscribes to a completion event. This also gives you retry, rate limiting, and cost accounting for free at the queue layer.

The N+1 problem, AI edition

GraphQL's classic N+1 becomes an N×cost problem with LLM fields. A list query with an AI field — products { aiDescription } — fires one LLM call *per item*. Defenses, in order of preference:

Cache aggressively. AI fields over stable inputs are perfect cache candidates: key on a hash of (model + prompt template version + input), store in Redis with a long TTL.

Batch via DataLoader — and actually batch at the LLM level: one call that processes 20 items (ask for a JSON array out, validate it — see Zod vs Pydantic for AI validation) instead of 20 calls.

Bound list sizes on any type with AI fields, and set query-complexity costs so an attacker can't make you spend $50 of inference with one query.

Cost and abuse controls

Assign AI fields a high complexity score in your query-cost plugin (graphql-cost-analysis or Apollo's built-ins) and set per-client budgets.

Per-user rate limits on AI mutations specifically, not just global API limits.

Log token usage per resolver with the request context attached — when the bill spikes, you want to know which field and which client.

FAQ

Should the LLM call live in the resolver or a separate service? Thin resolvers, fat service. Resolvers handle graph mechanics; an AIService class owns prompts, retries, model fallbacks (fallback chains pattern) and caching. Testable, and reusable from REST/jobs too.

@defer instead of subscriptions? @defer (incremental delivery) fits "fast fields now, one slow AI field later" — a single response with the AI part arriving late. It's per-fragment, not per-token, and client support is still uneven; test your stack before committing.

Which model for resolver workloads? Latency-sensitive graph fields want small/fast models; quality-critical generation wants frontier ones — compare current options in the model library.

*Last updated: June 2026.*

Also available in 中文.