Modal AI Infrastructure: Complete Setup Guide

Serverless GPU infrastructure for AI workloads

By AI Skill Navigation Editorial TeamPublished June 12, 2026

Modal for AI Workloads: Complete Setup Guide

Modal is serverless GPUs with a Python-native developer experience: you decorate functions, declare their container image and GPU in code, and Modal handles provisioning, scaling (including to zero), and billing by the second. For AI teams it occupies a sweet spot — heavier than calling hosted APIs, far lighter than managing GPU Kubernetes — and it's become a default answer for "I need GPUs sometimes, not a cluster always."

The mental model

Everything is code — no YAML, no console-clicking:

python
import modal
app = modal.App('embeddings-service')
image = (modal.Image.debian_slim()
         .pip_install('sentence-transformers', 'torch'))
@app.function(image=image, gpu='A10G', timeout=600)
def embed(texts: list[str]) -> list[list[float]]:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('BAAI/bge-m3')
    return model.encode(texts).tolist()@app.local_entrypoint()
def main():
    print(len(embed.remote(['hello world'])))   # runs on an A10G in the cloud

modal run script.py and that function executes on a cloud GPU; modal deploy makes it a persistent endpoint. The container image, dependencies, GPU type, and scaling policy all live next to the function they serve — infrastructure-as-actual-code.

What it's genuinely good for

Bursty GPU jobs: nightly embedding backfills, batch transcription/OCR, periodic fine-tuning runs — pay for minutes, idle at zero.

Model inference endpoints with spiky traffic: @app.function + web endpoint decorator gives you an autoscaling HTTPS service; scale-to-zero means a demo or internal tool costs nothing while unused.

Parallel fan-out: fn.map(items) distributes across many containers — the "process 10K documents on 50 GPUs for 10 minutes" job that's miserable to schedule anywhere else.

Scheduled AI pipelines: cron decorators turn the enrichment/dedup/report jobs into deployed schedules with logs.

Production knobs that matter

Cold starts are the tax for scale-to-zero. Mitigations: keep_warm=1 for latency-sensitive endpoints (trades idle cost back), memory snapshots for faster Python startup, and Volumes to cache model weights so cold start ≠ re-download 15GB.

Volumes and Secrets: persistent storage for weights/datasets (modal.Volume), env-style secrets management built in — the two primitives that make real apps possible.

GPU selection: T4/A10G for small models and embeddings; L4/A100/H100 tiers for serious inference and training. Right-sizing here is your bill — start small, profile, move up.

Concurrency control: per-function max_containers / batching settings prevent a queue spike from provisioning 200 GPUs of surprise.

Where Modal is the wrong tool

Steady high-volume LLM serving: at sustained utilization, reserved capacity or self-managed vLLM beats per-second serverless pricing — serverless premium buys elasticity you're not using. The crossover math is the same buy-vs-rent calc as for hosted open-model APIs (which, for *standard* models, are simpler than running your own anything).

Just calling hosted LLM APIs: no GPUs needed — plain serverless/containers are cheaper and simpler.

Strict data-residency/VPC mandates: it's a managed cloud — check current enterprise/network options against your requirements (compliance lens).

Competitors in the same lane: RunPod (cheaper raw GPU-hours, less polished DX), Replicate (model-zoo-first), cloud-native serverless GPU offerings. Modal's moat is the Python DX; if your team feels infra-as-decorators is "how it should work," that's the fit signal.

FAQ

Cost predictability? Per-second billing with scale-to-zero is great for bursty loads and dangerous for runaway loops — set max_containers and budget alerts on day one.

Can I serve an open LLM on it? Yes — vLLM inside a Modal function with a web endpoint is a documented pattern; weights on a Volume, keep_warm for latency. At sustained load, revisit the crossover math above.

Local development? The same code runs locally (.local()) vs remotely (.remote()) — the dev loop is genuinely the selling point.

*Last updated: June 2026. GPU lineup and pricing move — verify on modal.com.*

Also available in 中文.

Modal AI Infrastructure: Complete Setup Guide

Modal for AI Workloads: Complete Setup Guide

The mental model

What it's genuinely good for

Production knobs that matter

Where Modal is the wrong tool

FAQ

Documentation

Getting Started

Learn more

Modal AI Infrastructure: Complete Setup Guide

Modal for AI Workloads: Complete Setup Guide

The mental model

What it's genuinely good for

Production knobs that matter

Where Modal is the *wrong* tool

FAQ

Documentation

Getting Started

Learn more

Where Modal is the wrong tool