← Back to tutorials

Modal AI Infrastructure: Complete Setup Guide

Serverless GPU infrastructure for AI workloads

Modal for AI Workloads: Complete Setup Guide

Modal is serverless GPUs with a Python-native developer experience: you decorate functions, declare their container image and GPU in code, and Modal handles provisioning, scaling (including to zero), and billing by the second. For AI teams it occupies a sweet spot — heavier than calling hosted APIs, far lighter than managing GPU Kubernetes — and it's become a default answer for "I need GPUs sometimes, not a cluster always."

The mental model

Everything is code — no YAML, no console-clicking:

python
import modal

app = modal.App('embeddings-service')

image = (modal.Image.debian_slim() .pip_install('sentence-transformers', 'torch'))

@app.function(image=image, gpu='A10G', timeout=600) def embed(texts: list[str]) -> list[list[float]]: from sentence_transformers import SentenceTransformer model = SentenceTransformer('BAAI/bge-m3') return model.encode(texts).tolist()

@app.local_entrypoint() def main(): print(len(embed.remote(['hello world']))) # runs on an A10G in the cloud

modal run script.py and that function executes on a cloud GPU; modal deploy makes it a persistent endpoint. The container image, dependencies, GPU type, and scaling policy all live next to the function they serve — infrastructure-as-actual-code.

What it's genuinely good for

  • Bursty GPU jobs: nightly embedding backfills, batch transcription/OCR, periodic fine-tuning runs — pay for minutes, idle at zero.
  • Model inference endpoints with spiky traffic: @app.function + web endpoint decorator gives you an autoscaling HTTPS service; scale-to-zero means a demo or internal tool costs nothing while unused.
  • Parallel fan-out: fn.map(items) distributes across many containers — the "process 10K documents on 50 GPUs for 10 minutes" job that's miserable to schedule anywhere else.
  • Scheduled AI pipelines: cron decorators turn the enrichment/dedup/report jobs into deployed schedules with logs.
  • Production knobs that matter

  • Cold starts are the tax for scale-to-zero. Mitigations: keep_warm=1 for latency-sensitive endpoints (trades idle cost back), memory snapshots for faster Python startup, and Volumes to cache model weights so cold start ≠ re-download 15GB.
  • Volumes and Secrets: persistent storage for weights/datasets (modal.Volume), env-style secrets management built in — the two primitives that make real apps possible.
  • GPU selection: T4/A10G for small models and embeddings; L4/A100/H100 tiers for serious inference and training. Right-sizing here is your bill — start small, profile, move up.
  • Concurrency control: per-function max_containers / batching settings prevent a queue spike from provisioning 200 GPUs of surprise.
  • Where Modal is the *wrong* tool

  • Steady high-volume LLM serving: at sustained utilization, reserved capacity or self-managed vLLM beats per-second serverless pricing — serverless premium buys elasticity you're not using. The crossover math is the same buy-vs-rent calc as for hosted open-model APIs (which, for *standard* models, are simpler than running your own anything).
  • Just calling hosted LLM APIs: no GPUs needed — plain serverless/containers are cheaper and simpler.
  • Strict data-residency/VPC mandates: it's a managed cloud — check current enterprise/network options against your requirements (compliance lens).
  • Competitors in the same lane: RunPod (cheaper raw GPU-hours, less polished DX), Replicate (model-zoo-first), cloud-native serverless GPU offerings. Modal's moat is the Python DX; if your team feels infra-as-decorators is "how it should work," that's the fit signal.

    FAQ

    Cost predictability? Per-second billing with scale-to-zero is great for bursty loads and dangerous for runaway loops — set max_containers and budget alerts on day one.

    Can I serve an open LLM on it? Yes — vLLM inside a Modal function with a web endpoint is a documented pattern; weights on a Volume, keep_warm for latency. At sustained load, revisit the crossover math above.

    Local development? The same code runs locally (.local()) vs remotely (.remote()) — the dev loop is genuinely the selling point.


    *Last updated: June 2026. GPU lineup and pricing move — verify on modal.com.*

    Also available in 中文.