Llama 3.3 70B (2025-12): What's New and How to Use It
Complete guide to the latest Llama 3.3 70B capabilities: best open-source performance, multilingual
Llama 3.3 70B: What It Changed and How to Run It
Llama 3.3 70B (Meta, December 2024) mattered for one headline reason: it delivered roughly the instruction-following quality of the giant Llama 3.1 405B in a 70B package — per Meta's own announcement — making "frontier-ish quality on hardware you can actually rent" real. It became the default serious open-weights choice of its generation, and understanding it still frames how to evaluate today's open models.
What it actually changed
Running it: the three realistic paths
1. Local via Ollama (evaluation, light use):
bash
ollama run llama3.3
Quantized (Q4) needs ~40-43GB — a 64GB Mac or dual-24GB-GPU box runs it; 32GB machines should look at smaller models instead (local model comparison).
2. Hosted open-model APIs (production without GPUs): Together, Fireworks, Groq and peers serve Llama models at per-token prices far below closed flagships — OpenAI-compatible endpoints, so it's a base-URL swap:
python
from openai import OpenAI
client = OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key='...')
resp = client.chat.completions.create(
model='accounts/fireworks/models/llama-v3p3-70b-instruct', # provider-specific ID
messages=[{'role': 'user', 'content': 'Classify this support ticket...'}],
)
(Provider trade-offs: Fireworks production guide.)
3. Self-hosted vLLM (volume + data control): an 8-GPU node with tensor parallelism, or FP8/AWQ quantization to shrink the footprint — sizing math and throughput levers in our inference optimization and KV cache guides.
What it was (and wasn't) good for
Strong: instruction following, summarization, extraction, RAG answer synthesis, multilingual chat, fine-tuning base for domain models (the 70B + LoRA recipe — guide — became the standard enterprise pattern).
Weak vs closed flagships of its time: hardest math/reasoning chains, agentic tool-use reliability over long horizons, and anything needing vision.
Where it stands now
Meta has since moved to the Llama 4 family (MoE architecture, multimodal), and open-weights competition intensified — Qwen, DeepSeek, and Kimi K2 all ship open models that beat Llama 3.3 on various axes. But 3.3-70B remains widely deployed in 2026 because it's a known quantity: stable behavior, huge tooling/fine-tune ecosystem, and runs everywhere. The evaluation playbook it taught — judge open models by deployment-cost-per-quality, license terms, and ecosystem, not leaderboard position alone — is the durable lesson; current side-by-sides live in the model library.
FAQ
Llama 3.3 vs 3.1-405B today? Unless you're chasing the last points on hard reasoning, 3.3-70B (or a newer mid-size open model) is the rational choice — the 405B's serving cost rarely justifies the delta.
Can I use outputs to train other models? The license has specific terms on this (and on attribution) — check the current Llama license text rather than folklore.
Fine-tune 3.3 or prompt a frontier API? Decision framework in fine-tuning vs RAG: stable narrow task + volume economics → fine-tune the 70B; everything else → start with prompting.
*Last updated: June 2026. Specs per Meta's release notes; verify license and current model lineup at llama.com.*
Also available in 中文.