Modal AI Infrastructure: Complete Setup Guide
Serverless GPU infrastructure for AI workloads
Modal for AI Workloads: Complete Setup Guide
Modal is serverless GPUs with a Python-native developer experience: you decorate functions, declare their container image and GPU in code, and Modal handles provisioning, scaling (including to zero), and billing by the second. For AI teams it occupies a sweet spot — heavier than calling hosted APIs, far lighter than managing GPU Kubernetes — and it's become a default answer for "I need GPUs sometimes, not a cluster always."
The mental model
Everything is code — no YAML, no console-clicking:
python
import modalapp = modal.App('embeddings-service')
image = (modal.Image.debian_slim()
.pip_install('sentence-transformers', 'torch'))
@app.function(image=image, gpu='A10G', timeout=600)
def embed(texts: list[str]) -> list[list[float]]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-m3')
return model.encode(texts).tolist()
@app.local_entrypoint()
def main():
print(len(embed.remote(['hello world']))) # runs on an A10G in the cloud
modal run script.py and that function executes on a cloud GPU; modal deploy makes it a persistent endpoint. The container image, dependencies, GPU type, and scaling policy all live next to the function they serve — infrastructure-as-actual-code.
What it's genuinely good for
@app.function + web endpoint decorator gives you an autoscaling HTTPS service; scale-to-zero means a demo or internal tool costs nothing while unused.fn.map(items) distributes across many containers — the "process 10K documents on 50 GPUs for 10 minutes" job that's miserable to schedule anywhere else.Production knobs that matter
keep_warm=1 for latency-sensitive endpoints (trades idle cost back), memory snapshots for faster Python startup, and Volumes to cache model weights so cold start ≠ re-download 15GB.modal.Volume), env-style secrets management built in — the two primitives that make real apps possible.max_containers / batching settings prevent a queue spike from provisioning 200 GPUs of surprise.Where Modal is the *wrong* tool
Competitors in the same lane: RunPod (cheaper raw GPU-hours, less polished DX), Replicate (model-zoo-first), cloud-native serverless GPU offerings. Modal's moat is the Python DX; if your team feels infra-as-decorators is "how it should work," that's the fit signal.
FAQ
Cost predictability? Per-second billing with scale-to-zero is great for bursty loads and dangerous for runaway loops — set max_containers and budget alerts on day one.
Can I serve an open LLM on it? Yes — vLLM inside a Modal function with a web endpoint is a documented pattern; weights on a Volume, keep_warm for latency. At sustained load, revisit the crossover math above.
Local development? The same code runs locally (.local()) vs remotely (.remote()) — the dev loop is genuinely the selling point.
*Last updated: June 2026. GPU lineup and pricing move — verify on modal.com.*
Also available in 中文.